我有一个关于基于一列中的strg模式的两个pd.dataframe的合并的问题.关于Stackoverlow有一些非常有用的讨论,我找到了一种非常适合我的需求的方法(Merge two dataframe if one string column is contained in another column in Pandas).
这种方法在我的MWE中非常有效.
# Target-df
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
'SIEGFRIED LTD. Zofingen CH',
'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ',
],
'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
'R2-CEP 1996-036 - Rev 02',
'R0-CEP 2008-165 - Rev 00',
'R1-CEP 2002-193 - Rev 00',
],
'Substance':['Suxamethonium Chloride',
'Amitriptyline hydrochloride',
'Oxytetracycline hydrochloride',
'Ephedrine hydrochloride',
],
}
)
# print(df)
Company | Certificate+Number | Substance |
---|---|---|
MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN | R1-CEP 2012-025 - Rev 02 | Suxamethonium Chloride |
SIEGFRIED LTD. Zofingen CH | R2-CEP 1996-036 - Rev 02 | Amitriptyline hydrochloride |
SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN | R0-CEP 2008-165 - Rev 00 | Oxytetracycline hydrochloride |
CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ | R1-CEP 2002-193 - Rev 00 | Ephedrine hydrochloride |
其次,我有一个关于城市、国家、国家代码等信息的巨大df.首先,作为一个最小的例子:
world_cities_min = pd.DataFrame({'Geoname ID':[1275339,
'2657915',
'1785286',
'3061344',
],
'City':['Mumbai',
'Zofingen',
'Zibo',
'Zibo',
],
'ASCII Name':['Mumbai',
'Zofingen',
'Zibo',
'City',
],
'Country':['India',
'Switzerland',
'China',
'Czech Republic',
],
'Alpha2':['IN',
'CH',
'CN',
'CZ',
],
})
#print(world_cities_min.head(5))
Geoname ID | City | ASCII Name | Country | Alpha2 |
---|---|---|---|---|
1275339 | Mumbai | Mumbai | India | IN |
2657915 | Zofingen | Zofingen | Switzerland | CH |
1785286 | Zibo | Zibo | China | CH |
3061344 | Zibo | City | Czech Republic | CZ |
提取模式以找到城市名称(根据来自源Merge two dataframe if one string column is contained in another column in Pandas的方法
pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])
# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
#print(df)
但是,当我使用WorldCities的完整DF时,我收到以下错误:ValueError:无法将具有多列的DataFrame设置为单列ASCII名称
# Once again, the original target-df
df = pd.DataFrame({'Company':['MAC CHEM PRODUCTS (INDIA) PVT. LTD. Mumbai IN',
'SIEGFRIED LTD. Zofingen CH',
'SHANDONG JINYANG PHARMACEUTICAL CO., LTD. Zibo City CN',
'CHIFENG ARKER PHARMACEUTICAL TECHNOLOGY CO., LTD. Zibo CZ',
],
'Certificate+Number':['R1-CEP 2012-025 - Rev 02',
'R2-CEP 1996-036 - Rev 02',
'R0-CEP 2008-165 - Rev 00',
'R1-CEP 2002-193 - Rev 00',
],
'Substance':['Suxamethonium Chloride',
'Amitriptyline hydrochloride',
'Oxytetracycline hydrochloride',
'Ephedrine hydrochloride',
],
}
)
加载完整的df
url = 'https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/geonames-all-cities-with-a-population-1000/exports/csv?lang=en&timezone=Europe%2FBerlin&use_labels=true&delimiter=%3B'
column_names = ['Geoname ID',
'Name',
'ASCII Name',
'Alternate Names',
'Feature Class',
'Feature Code',
'Country Code',
'Country name EN',
'Country Code 2' ,
'Admin1 Code' ,
'Admin2 Code' ,
'Admin3 Code',
'Admin4 Code',
'Population',
'Elevation',
'DIgital Elevation Model',
'Timezone',
'Modification date',
'LABEL EN',
'Coordinates'
]
world_cities = pd.read_csv(url,
header=1,
sep=';',
names=column_names,
usecols = [
'Name',
'ASCII Name',
'Country Code' ,
'Country name EN',
'Coordinates'],
converters={
},
)
...做同样的事情:
pat = '|'.join(r"\b{}\b".format(x) for x in world_cities_min['ASCII Name'])
# and create column in target-df according to the name of the city
df['ASCII Name']= df['Company'].str.extract('('+ pat + ')', expand=False)
#print(df)
指向:ValueError:无法将具有多列的DataFrame设置为单列ASCII名称
我可以请您帮我排除故障吗?问题在完整的DF中在哪里,我如何处理它?我的总体目标是将City、Country Name和Alpha2代码作为单独的列.不幸的是,信息存在于df['Company']
中,没有唯一的字符串模式
非常感谢你的建议.