现在,我有了一个视图,其中有一堆常见的、有条件的字符串替换和对开放文本字段的替换——在本例中是区域分类.
(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)
CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
RETURNS text
LANGUAGE sql
STRICT
AS $function$
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
,'Rest\sof\sthe\sWorld\s\-\s','')
,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
,'\&\;','&')
,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
,'EUROPE - Turkey','MIDDLE EAST - Turkey')
使用此函数的示例输出在我的数据集中看起来是这样的,它会拉出受影响的记录(有些记录的格式已经正确):
record_region_input | record_region_output |
---|---|
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro | USA - NORTHEAST - Massachusetts - Boston Metro |
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia | USA - MIDATLANTIC - Virginia |
Rest of the World - ASIA - Thailand | ASIA - Thailand |
Rest of the World - EUROPE - Portugal | EUROPE - Portugal |
Rest of the World - ASIA - China - Shanghai Metro | ASIA - China - Shanghai Metro |
Georgia - Georgia | MIDDLE EAST - Georgia |
这是...好的正则表达式是必需的,因为在这些字符串之前或之后可能会出现大量的变化,我在其他地方有一个适当的验证列表.这只是一些常见的历史命名问题.
The problem is where I get hundreds of these kind of "known substitutions" (100+)用于公司命名或跨部门标准.有几十个REGEXP_REPLACE(
个嵌套语句,使得编辑/添加/删除任何内容都成为一个疯狂的计数游戏.
我正在try 只清理Postgres中的数据,因为我目前的管道并不总是允许在上传之前进行标准化.我知道如何在纯SQL之外清晰地解决这个问题,但在"普通"PostgreSQL实例(v12+)中,有没有更好的方法来转换视图的字符串?
Updated with a sample input/output table using the example function.