我有一个关于基于一列中的strg模式的两个pd.dataframe的合并的问题.关于Stackoverlow有一些非常有用的讨论,我找到了一种非常适合我的需求的方法(Merge two dataframe if one string column is contained in another column in Pandas).

这种方法在我的MWE中非常有效.

# Target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

# print(df)
Company Certificate+Number Substance
MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN R1-CEP 2012-025 - Rev 02 Suxamethonium Chloride
SIEGFRIED LTD. Zofingen CH R2-CEP 1996-036 - Rev 02 Amitriptyline hydrochloride
SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN R0-CEP 2008-165 - Rev 00 Oxytetracycline hydrochloride
CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ R1-CEP 2002-193 - Rev 00 Ephedrine hydrochloride

其次,我有一个关于城市、国家、国家代码等信息的巨大df.首先,作为一个最小的例子:

world_cities_min = pd.DataFrame({'Geoname ID':[1275339,
                                 '2657915',
                                 '1785286',
                                 '3061344', 
                                 ], 
                                  'City':['Mumbai',
                                          'Zofingen',
                                          'Zibo',
                                          'Zibo',
                                           ],
                                  'ASCII Name':['Mumbai',
                                                'Zofingen',
                                                'Zibo',
                                                'City', 
                                                 ], 
                                  'Country':['India',
                                             'Switzerland',
                                             'China',
                                             'Czech Republic', 
                                            ],
                                  'Alpha2':['IN',
                                            'CH',
                                            'CN',
                                            'CZ', 
                                            ], 
                               })
    
#print(world_cities_min.head(5))
Geoname ID City ASCII Name Country Alpha2
1275339 Mumbai Mumbai India IN
2657915 Zofingen Zofingen Switzerland CH
1785286 Zibo Zibo China CH
3061344 Zibo City Czech Republic CZ

提取模式以找到城市名称(根据来自源Merge two dataframe if one string column is contained in another column in Pandas的方法

pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

但是,当我使用WorldCities的完整DF时,我收到以下错误:ValueError:无法将具有多列的DataFrame设置为单列ASCII名称

# Once again, the original target-df  
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
                              'SIEGFRIED LTD. Zofingen CH',
                              'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
                              'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ', 
                               ], 
                   'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
                                         'R2-CEP 1996-036 - Rev 02',
                                         'R0-CEP 2008-165 - Rev 00',
                                         'R1-CEP 2002-193 - Rev 00',
                                          ],
                   'Substance':['Suxamethonium Chloride',
                                'Amitriptyline hydrochloride',
                                'Oxytetracycline hydrochloride',
                                'Ephedrine hydrochloride', 
                                 ], 
                       }
                       )

加载完整的df

url = 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/csv?lang=en&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B'


column_names = ['Geoname ID',
                'Name', 
                'ASCII Name',   
                'Alternate Names',
                'Feature Class',
                'Feature Code',
                'Country Code',
                'Country name EN',  
                'Country Code 2'    ,
                'Admin1 Code'   ,
                'Admin2 Code'   ,
                'Admin3 Code',  
                'Admin4 Code',  
                'Population',
                'Elevation',    
                'DIgital Elevation Model',  
                'Timezone', 
                'Modification date',    
                'LABEL EN', 
                'Coordinates'
                 ]
    
world_cities  = pd.read_csv(url,
                        header=1,
                        sep=';',
                          names=column_names,
                          usecols = [
                                    'Name', 
                                    'ASCII Name',   
                                    'Country Code'  ,
                                    'Country name EN',  
                                    'Coordinates'],
                            converters={
                                        },
                          )

...做同样的事情:

pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])

# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
    
#print(df)

指向:ValueError:无法将具有多列的DataFrame设置为单列ASCII名称

我可以请您帮我排除故障吗?问题在完整的DF中在哪里,我如何处理它?我的总体目标是将City、Country Name和Alpha2代码作为单独的列.不幸的是,信息存在于df['Company']中,没有唯一的字符串模式

非常感谢你的建议.

推荐答案

错误的原因

较大的world_cities个数据帧包含一些城市名称中的字符,这些字符在正则表达式中具有特殊意义.例如,这些名称中的一些包含括号(),它具有特殊的含义,用于表示捕获组.让我们来看一下从world_cities

                  Name      ASCII Name Country Code Country name EN         Coordinates
84328      Hamm (Sieg)     Hamm (Sieg)           DE         Germany   50.76531, 7.67761
63174    Obolo-Eke (1)   Obolo-Eke (1)           NG         Nigeria    6.88333, 7.63333
50291    Halle (Saale)   Halle (Saale)           DE         Germany  51.48158, 11.97947
126292  Seen (Kreis 3)  Seen (Kreis 3)           CH     Switzerland   47.47646, 8.76996
131692  Schwedt (Oder)  Schwedt (Oder)           DE         Germany  53.05963, 14.28154

import re 

# ensure null values are dropped
cities = world_cities['ASCII Name'].dropna()

# Escape the special regex reserved characters in city names
pat = r'\b(%s)\b' % '|'.join(map(re.escape, cities))

# extract the matching occurences of the regex pattern
df['ASCII name'] = df['Company'].str.extract(pat, expand=False)

结果

                                                     Company        Certificate+Number                      Substance ASCII name
0              MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN  R1-CEP 2012-025 - Rev 02         Suxamethonium Chloride     Mumbai
1                                 SIEGFRIED LTD. Zofingen CH  R2-CEP 1996-036 - Rev 02    Amitriptyline hydrochloride   Zofingen
2     SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN  R0-CEP 2008-165 - Rev 00  Oxytetracycline hydrochloride       Zibo
3  CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ  R1-CEP 2002-193 - Rev 00        Ephedrine hydrochloride       Zibo

Python相关问答推荐

Python daskValue错误:无法识别的区块管理器dask -必须是以下之一:[]

2D空间中的反旋算法

Polars:用氨纶的其他部分替换氨纶的部分

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时, - help不再工作""""

关于Python异步编程的问题和使用await/await def关键字

使用密钥字典重新配置嵌套字典密钥名

joblib:无法从父目录的另一个子文件夹加载转储模型

如何在Pyplot表中舍入值

(Python/Pandas)基于列中非缺失值的子集DataFrame

PYTHON、VLC、RTSP.屏幕截图不起作用

当条件满足时停止ODE集成?

在第一次调用时使用不同行为的re. sub的最佳方式

来自Airflow Connection的额外参数

极柱内丢失类型信息""

分解polars DataFrame列而不重复其他列值

上传文件并使用Panda打开时的Flask 问题

如果服务器设置为不侦听创建,则QWebSocket客户端不连接到QWebSocketServer;如果服务器稍后开始侦听,则不连接

无法使用请求模块从网页上抓取一些产品的名称

Numpy`astype(Int)`给出`np.int64`而不是`int`-怎么办?

如何通过特定导入在类中执行Python代码