我正在使用PYDANIC创建一个基于PANDA TIMESTAMP(start
,end
)和Timedelta(period
)对象的时间序列模型.该模型将用于一个具有多个配置/场景的小型数据分析程序.
我需要基于两个bool(include_end_period
,allow_future
)和一个可选int(max_periods
)配置参数实例化和验证TimeSeries模型的各个方面.然后,我需要派生三个新字段(timezone
、total_duration
、total_periods
)并执行一些额外的验证.
由于在验证另一个值时需要使用一个值的几个实例,我无法使用典型的@validator
个方法获得所需的结果.特别是,我经常会得到一个丢失的KeyError,而不是预期的ValueError.我找到的最好的解决方案是创建一个长@root_validator(pre=True)
方法.
from pydantic import BaseModel, ValidationError, root_validator, conint
from pandas import Timestamp, Timedelta
class Timeseries(BaseModel):
start: Timestamp
end: Timestamp
period: Timedelta
include_end_period: bool = False
allow_future: bool = True
max_periods: conint(gt=0, strict=True) | None = None
# Derived values, do not pass as params
timezone: str | None
total_duration: Timedelta
total_periods: conint(gt=0, strict=True)
class Config:
extra = 'forbid'
validate_assignment = True
@root_validator(pre=True)
def _validate_model(cls, values):
# Validate input values
if values['start'] > values['end']:
raise ValueError('Start timestamp cannot be later than end')
if values['start'].tzinfo != values['end'].tzinfo:
raise ValueError('Start, end timezones do not match')
if values['period'] <= Timedelta(0):
raise ValueError('Period must be a positive amount of time')
# Set timezone
timezone = values['start'].tzname()
if 'timezone' in values and values['timezone'] != timezone:
raise ValueError('Timezone param does not match start timezone')
values['timezone'] = timezone
# Set duration (add 1 period if including end period)
total_duration = values['end'] - values['start']
if values['include_end_period']:
total_duration += values['period']
if 'total_duration' in values and values['total_duration'] != total_duration:
error_context = ' + 1 period (included end period)' if values['include_end_period'] else ''
raise ValueError(f'Duration param does not match end - start timestamps{error_context}')
values['total_duration'] = total_duration
# Set total_periods
periods_float: float = values['total_duration'] / values['period']
if periods_float != int(periods_float):
raise ValueError('Total duration not divisible by period length')
total_periods = int(periods_float)
if 'total_periods' in values and values['total_periods'] != total_periods:
raise ValueError('Total periods param does not match')
values['total_periods'] = total_periods
# Validate future
if not values['allow_future']:
# Get current timestamp to floor of period (subtract 1 period if including end period)
max_end: Timestamp = Timestamp.now(tz=values['timezone']).floor(freq=values['period'])
if values['include_end_period']:
max_end -= values['period']
if values['end'] > max_end:
raise ValueError('End period is future or current (incomplete)')
# Validate derived values
if values['total_duration'] < Timedelta(0):
raise ValueError('Total duration must be positive amount of time')
if values['max_periods'] and values['total_periods'] > values['max_periods']:
raise ValueError('Total periods exceeds max periods param')
return values
在令人满意的情况下实例化模型,使用所有配置判断:
start = Timestamp('2023-03-01T00:00:00Z')
end = Timestamp('2023-03-02T00:00:00Z')
period = Timedelta('5min')
try:
ts = Timeseries(start=start, end=end, period=period,
include_end_period=True, allow_future=False, max_periods=10000)
print(ts.dict())
except ValidationError as e:
print(e)
输出:
"""
{'start': Timestamp('2023-03-01 00:00:00+0000', tz='UTC'),
'end': Timestamp('2023-03-02 00:00:00+0000', tz='UTC'),
'period': Timedelta('0 days 00:05:00'),
'include_end_period': True,
'allow_future': False,
'max_periods': 10000,
'timezone': 'UTC',
'total_duration': Timedelta('1 days 00:05:00'),
'total_periods': 289}
"""
在这里,我相信我的所有验证都像预期的那样工作,并提供了预期的ValueErrors,而不是帮助较小的KeyErrors.Is this approach reasonable?这似乎与典型的/推荐的方法背道而驰,与@validator
相比,@root_validator
的文档相当简短.
我还不满意需要在模型顶部列出派生值(timezone
、total_duration
、total_periods
).这意味着它们可以/应该在实例化时传递,并且在我的验证器脚本中需要额外的逻辑来判断它们是否被传递,以及它们是否与派生的值匹配.如果省略它们,它们将不会从类型、约束等的默认验证中受益,并且会迫使我将配置更改为extra='allow'
.如果有任何关于如何改进这方面的建议,我将不胜感激.
谢谢!