我有this file个.

如果我读了它并用正则表达式进行了一些清理:

import pandas as pd
import re

df = pd.read_csv('data.csv', index_col=[0])
out = df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)', 
                                             expand=False,flags=re.I).str.replace(r'[^\d.]+', '', regex=True))

out.index+=1
out

我可以看到以下结果:

      X     Y
2   NaN     NaN
3   NaN     NaN
4   573456.81   3887265.85
5   573453.26   NaN
6   573450.98   NaN
7   NaN     NaN
8   NaN     NaN
9   573445.167  3887284.597
10  NaN     NaN
11  NaN     NaN
12  573703.6758759461   3887233.5301764384
....

它包含NaN个值,而不是应用清理.

奇怪的是,清理是有效的,因为如果我只复制我加载到列表中的数据帧的内容:

thelist = [['1', '573436.862', '3887259.269'],
 ['2', '573436.031', '3887248.472'],
 ['3', '573456.81', '3887265.85'],
 ['4', '573453.26', '3887273.017'],
 ['5', '573450.98', '3887275.878'],
 ['6', '573451.611', '3887276.346'],
 ['7', '573446.959', '3887285.738'],
 ['8', 'H5m7er3o4m45h.n1i6a7 print: 19/02/202', '4 15:24 3887284.597'],
 ['9', '573440.184', '3887292.487'],
 ['10', '573436.862', '3887259.269'],
 ['1', '573703.6758759461', '3887233.5301764384'],
 ['2', '573703.7165950707', '3887233.6523487056'],
 ['3', '573703.769', '3887233.809'],
 ['4', '573707.305', '3887241.398'],
 ['5', '573712.9489897437', '3887251.2139821625'],
 ['6', '57mro3m71hn2ia.949print: 19/02/2024', '15:22 3887251.2139999997'],
 ['7', '573712.981495283', '3887251.2813396226'],
 ['8', '573713H.0m3e2romhnia print: 19/0', '2/2024 15:24 3887251.386'],
 ['9', '573713.096', '3887251.567'],
 ['10', '573713.0960000466', '3887251.5670001707'],
 ['11', '573713.266443923', '3887252.1920684506'],
 ['12', '573725.815', '3887254.127'],
 ['13', '573733.267', '3887255.275'],
 ['14', '573736.197', '3887240.846'],
 ['15', '573742.399', '3887229.682'],
 ['16', '573701.647', '3887220.061'],
 ['17', '573703.6758759461', '3887233.5301764384']]

并进行清洁:

arr = np.hstack(thelist)
arr = arr.reshape(arr.shape[0] // 3, 3)
new_df = pd.DataFrame(arr, columns=['K', 'X', 'Y'])

out = new_df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)', 
                                               expand=False, flags=re.I)
                                               .str.replace(r'[^\d.]+', '', regex=True))

我收到了正确的结果!

    X   Y
0   573436.862  3887259.269
1   573436.031  3887248.472
2   573456.81   3887265.85
3   573453.26   3887273.017
4   573450.98   3887275.878
5   573451.611  3887276.346
6   573446.959  3887285.738
7   573445.167  3887284.597
8   573440.184  3887292.487
9   573436.862  3887259.269
10  573703.6758759461   3887233.5301764384
11  573703.7165950707   3887233.6523487056
12  573703.769  3887233.809
13  573707.305  3887241.398
....

推荐答案

如果您看到两种方法之间的输出不同,这是正常的,因为您使用的是不同的输入.在下面的示例中,我们可以看到CSV(unlike the list)中的一些行不包含点(.),并且在您的正则表达式模式中,您需要一个\.,这导致在第一种方法中得到NaN个值.

0,1,573436862,3887259269           # << first record of the .csv

['1', '573436.862', '3887259.269'] # << first record of the list

Python相关问答推荐

提取两行之间的标题的常规表达

根据不同列的值在收件箱中移动数据

如何使用Python将工作表从一个Excel工作簿复制粘贴到另一个工作簿?

Python json.转储包含一些UTF-8字符的二元组,要么失败,要么转换它们.我希望编码字符按原样保留

抓取rotowire MLB球员新闻并使用Python形成表格

有症状地 destruct 了Python中的regex?

pandas滚动和窗口中有效观察的最大数量

海上重叠直方图

Python逻辑操作作为Pandas中的条件

在pandas数据框中计算相对体积比指标,并添加指标值作为新列

lityter不让我输入左边的方括号,'

搜索按钮不工作,Python tkinter

如何使用matplotlib查看并列直方图

当输入是字典时,`pandas. concat`如何工作?

有了Gekko,可以创建子模型或将模型合并在一起吗?

如何在Quarto中的标题页之前创建序言页

是否将Pandas 数据帧标题/标题以纯文本格式转换为字符串输出?

在不中断格式的情况下在文件的特定部分插入XML标签

为什么在生成时间序列时,元组索引会超出范围?

Fake pathlib.使用pyfakefs的类变量中的路径'