I have a regex that looks like this to extract order numbers from columns:

df["Orders"].str.extract('([0-9]{9,10}[/+ #_;.-]?)')

The orders column can look like this:

12
123456789
1234567890
123456789/1234567890
123456789/1/123456789
123456789+1234567890

The resulting new column in the dataframe after the regex should look like this:

NaN
123456789
1234567890
123456789/1234567890
123456789/123456789
123456789+1234567890

However, with my current regex I'm getting the following result:

NaN
123456789
1234567890
123456789/
123456789/
123456789+

How can I get the result that I'm looking for?

推荐答案

You can use

import pandas as pd
df = pd.DataFrame({'Orders':['12','123456789','1234567890','123456789/1234567890','123456789/1/123456789','123456789+1234567890', 'Order number: 6508955960_000010_1005500']})
df["Result"] = df["Orders"].str.findall(r'[/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9])').str.join('').str.lstrip('/+ #_;.-')
df.loc[df['Result'] == '', 'Result'] = np.nan

See the regex demo. Details

  • [/+ #_;.-]?(?<![0-9])[0-9]{9,10}(?![0-9]) - matches an optional /, +, space, #, _, ;, . or - char, and then none or ten digit number not enclosed with other digits
  • Series.str.findall extracts all occurrences
  • .str.join('') concatenates the matches into a single string
  • .str.lstrip('/+ #_;.-') - removes the special chars that were matched with the number at the beginning of the string
  • df.loc[df['Result'] == '', 'Result'] = np.nan - if needed - replaces empty strings with np.nan values in the Result column.

Output:

>>> df
                  Orders                Result
0                    NaN                   NaN
1              123456789             123456789
2             1234567890            1234567890
3   123456789/1234567890  123456789/1234567890
4  123456789/1/123456789   123456789/123456789
5   123456789+1234567890  123456789+1234567890
>>> 

Python相关问答推荐

如何使用symy打印方程?

海上重叠直方图

在Python中,从给定范围内的数组中提取索引组列表的更有效方法

转换为浮点,pandas字符串列,混合千和十进制分隔符

提取相关行的最快方法—pandas

为什么Django管理页面和我的页面的其他CSS文件和图片都找不到?'

网格基于1.Y轴与2.x轴显示在matplotlib中

幂集,其中每个元素可以是正或负""""

Polars将相同的自定义函数应用于组中的多个列,

在不同的帧B中判断帧A中的子字符串,每个帧的大小不同

如何检测鼠标/键盘的空闲时间,而不是其他输入设备?

在matplotlib中使用不同大小的标记顶部添加批注

计算空值

并行编程:同步进程

Cython无法识别Numpy类型

freq = inject在pandas中做了什么?''它与freq = D有什么不同?''

具有不匹配列的2D到3D广播

Pandas 删除只有一种类型的值的行,重复或不重复

迭代工具组合不会输出大于3的序列

使用OpenPYXL切换图表上的行/列