这是我遇到的一个问题:

我想绘制使用量子Espresso 获得的一些能带.数据位于一个有两列的文件中.这些列由白线分隔成块.每个块对应一个带.

以下是前两个块的示例:

    0.0000  -44.2709
    0.0250  -44.2709
    0.0500  -44.2709
    0.0750  -44.2708
    0.1000  -44.2708
    0.1250  -44.2707
    0.1500  -44.2706
    0.1750  -44.2705
    0.2000  -44.2703
    0.2250  -44.2702
    0.2500  -44.2701
    0.2750  -44.2700
    0.3000  -44.2698
    0.3250  -44.2697
    0.3500  -44.2696
    0.3750  -44.2695
    0.4000  -44.2694
    0.4250  -44.2694
    0.4500  -44.2693
    0.4750  -44.2693
    0.5000  -44.2693
    0.5250  -44.2693
    0.5500  -44.2692
    0.5750  -44.2692
    0.6000  -44.2691
    0.6250  -44.2690
    0.6500  -44.2689
    0.6750  -44.2688
    0.7000  -44.2687
    0.7250  -44.2686
    0.7500  -44.2685
    0.7750  -44.2683
    0.8000  -44.2682
    0.8250  -44.2681
    0.8500  -44.2680
    0.8750  -44.2679
    0.9000  -44.2678
    0.9250  -44.2678
    0.9500  -44.2677
    0.9750  -44.2677
    1.0000  -44.2677
    1.0354  -44.2677
    1.0707  -44.2677
    1.1061  -44.2678
    1.1414  -44.2680
    1.1768  -44.2681
    1.2121  -44.2683
    1.2475  -44.2686
    1.2828  -44.2688
    1.3182  -44.2690
    1.3536  -44.2693
    1.3889  -44.2695
    1.4243  -44.2698
    1.4596  -44.2700
    1.4950  -44.2702
    1.5303  -44.2704
    1.5657  -44.2706
    1.6010  -44.2707
    1.6364  -44.2708
    1.6718  -44.2709
    1.7071  -44.2709
    1.7504  -44.2709
    1.7937  -44.2708
    1.8370  -44.2706
    1.8803  -44.2704
    1.9236  -44.2702
    1.9669  -44.2699
    2.0102  -44.2696
    2.0535  -44.2692
    2.0968  -44.2689
    2.1401  -44.2685
    2.1834  -44.2681
    2.2267  -44.2677
    2.2700  -44.2674
    2.3133  -44.2671
    2.3566  -44.2668
    2.3999  -44.2665
    2.4432  -44.2663
    2.4865  -44.2662
    2.5298  -44.2661
    2.5731  -44.2661
    2.6085  -44.2661
    2.6438  -44.2661
    2.6792  -44.2662
    2.7146  -44.2664
    2.7499  -44.2665
    2.7853  -44.2667
    2.8206  -44.2669
    2.8560  -44.2672
    2.8913  -44.2674
    2.9267  -44.2677
    2.9620  -44.2679
    2.9974  -44.2682
    3.0328  -44.2684
    3.0681  -44.2686
    3.1035  -44.2688
    3.1388  -44.2690
    3.1742  -44.2691
    3.2095  -44.2692
    3.2449  -44.2693
    3.2802  -44.2693
    3.2802  -44.2677
    3.3052  -44.2677
    3.3302  -44.2676
    3.3552  -44.2676
    3.3802  -44.2675
    3.4052  -44.2674
    3.4302  -44.2673
    3.4552  -44.2672
    3.4802  -44.2671
    3.5052  -44.2670
    3.5302  -44.2669
    3.5552  -44.2667
    3.5802  -44.2666
    3.6052  -44.2665
    3.6302  -44.2664
    3.6552  -44.2663
    3.6802  -44.2662
    3.7052  -44.2662
    3.7302  -44.2661
    3.7552  -44.2661
    3.7802  -44.2661
 
    0.0000  -20.8317
    0.0250  -20.8322
    0.0500  -20.8338
    0.0750  -20.8364
    0.1000  -20.8400
    0.1250  -20.8445
    0.1500  -20.8497
    0.1750  -20.8555
    0.2000  -20.8618
    0.2250  -20.8684
    0.2500  -20.8751
    0.2750  -20.8819
    0.3000  -20.8884
    0.3250  -20.8947
    0.3500  -20.9004
    0.3750  -20.9055
    0.4000  -20.9098
    0.4250  -20.9133
    0.4500  -20.9159
    0.4750  -20.9174
    0.5000  -20.9179
    0.5250  -20.9179
    0.5500  -20.9178
    0.5750  -20.9175
    0.6000  -20.9172
    0.6250  -20.9169
    0.6500  -20.9164
    0.6750  -20.9159
    0.7000  -20.9154
    0.7250  -20.9149
    0.7500  -20.9143
    0.7750  -20.9137
    0.8000  -20.9132
    0.8250  -20.9126
    0.8500  -20.9122
    0.8750  -20.9117
    0.9000  -20.9113
    0.9250  -20.9110
    0.9500  -20.9108
    0.9750  -20.9107
    1.0000  -20.9106
    1.0354  -20.9102
    1.0707  -20.9089
    1.1061  -20.9068
    1.1414  -20.9039
    1.1768  -20.9003
    1.2121  -20.8959
    1.2475  -20.8910
    1.2828  -20.8855
    1.3182  -20.8797
    1.3536  -20.8736
    1.3889  -20.8673
    1.4243  -20.8611
    1.4596  -20.8551
    1.4950  -20.8495
    1.5303  -20.8444
    1.5657  -20.8400
    1.6010  -20.8365
    1.6364  -20.8338
    1.6718  -20.8322
    1.7071  -20.8317
    1.7504  -20.8322
    1.7937  -20.8338
    1.8370  -20.8365
    1.8803  -20.8400
    1.9236  -20.8443
    1.9669  -20.8492
    2.0102  -20.8545
    2.0535  -20.8601
    2.0968  -20.8659
    2.1401  -20.8716
    2.1834  -20.8772
    2.2267  -20.8826
    2.2700  -20.8876
    2.3133  -20.8922
    2.3566  -20.8962
    2.3999  -20.8997
    2.4432  -20.9025
    2.4865  -20.9045
    2.5298  -20.9058
    2.5731  -20.9062
    2.6085  -20.9063
    2.6438  -20.9064
    2.6792  -20.9067
    2.7146  -20.9071
    2.7499  -20.9076
    2.7853  -20.9082
    2.8206  -20.9089
    2.8560  -20.9096
    2.8913  -20.9105
    2.9267  -20.9114
    2.9620  -20.9123
    2.9974  -20.9132
    3.0328  -20.9142
    3.0681  -20.9151
    3.1035  -20.9159
    3.1388  -20.9166
    3.1742  -20.9171
    3.2095  -20.9176
    3.2449  -20.9178
    3.2802  -20.9179
    3.2802  -20.9106
    3.3052  -20.9106
    3.3302  -20.9105
    3.3552  -20.9104
    3.3802  -20.9102
    3.4052  -20.9100
    3.4302  -20.9097
    3.4552  -20.9094
    3.4802  -20.9091
    3.5052  -20.9088
    3.5302  -20.9084
    3.5552  -20.9081
    3.5802  -20.9078
    3.6052  -20.9074
    3.6302  -20.9071
    3.6552  -20.9069
    3.6802  -20.9066
    3.7052  -20.9065
    3.7302  -20.9063
    3.7552  -20.9063
    3.7802  -20.9062

您可能会注意到,第一列反复包含相同的数据,只有第二列具有不同的数据.我想做的是仅保留第一个块的第一列,然后将第二列转换为单独的列.像这样:

    0.0000  -44.2709   -20.8317
    0.0250  -44.2709   -20.8322
    0.0500  -44.2709   -20.8338
    0.0750  -44.2708   -20.8364
    0.1000  -44.2708   -20.8400
    0.1250  -44.2707   -20.8445
    0.1500  -44.2706   -20.8497
    0.1750  -44.2705   -20.8555
    0.2000  -44.2703   -20.8618
    0.2250  -44.2702   -20.8684
    0.2500  -44.2701   -20.8751
    0.2750  -44.2700   -20.8819
    0.3000  -44.2698   -20.8884
    0.3250  -44.2697   -20.8947
    0.3500  -44.2696   -20.9004
    0.3750  -44.2695   -20.9055
    0.4000  -44.2694   -20.9098
    0.4250  -44.2694   -20.9133
    0.4500  -44.2693   -20.9159
    0.4750  -44.2693   -20.9174
    0.5000  -44.2693   -20.9179
    0.5250  -44.2693   -20.9179
    0.5500  -44.2692   -20.9178
    0.5750  -44.2692   -20.9175
    0.6000  -44.2691   -20.9172
    0.6250  -44.2690   -20.9169
    0.6500  -44.2689   -20.9164
    0.6750  -44.2688   -20.9159
    0.7000  -44.2687   -20.9154
    0.7250  -44.2686   -20.9149
    0.7500  -44.2685   -20.9143
    0.7750  -44.2683   -20.9137
    0.8000  -44.2682   -20.9132
    0.8250  -44.2681   -20.9126
    0.8500  -44.2680   -20.9122
    0.8750  -44.2679   -20.9117
    0.9000  -44.2678   -20.9113
    0.9250  -44.2678   -20.9110
    0.9500  -44.2677   -20.9108
    0.9750  -44.2677   -20.9107
    1.0000  -44.2677   -20.9106
    1.0354  -44.2677   -20.9102
    1.0707  -44.2677   -20.9089
    1.1061  -44.2678   -20.9068
    1.1414  -44.2680   -20.9039
    1.1768  -44.2681   -20.9003
    1.2121  -44.2683   -20.8959
    1.2475  -44.2686   -20.8910
    1.2828  -44.2688   -20.8855
    1.3182  -44.2690   -20.8797
    1.3536  -44.2693   -20.8736
    1.3889  -44.2695   -20.8673
    1.4243  -44.2698   -20.8611
    1.4596  -44.2700   -20.8551
    1.4950  -44.2702   -20.8495
    1.5303  -44.2704   -20.8444
    1.5657  -44.2706   -20.8400
    1.6010  -44.2707   -20.8365
    1.6364  -44.2708   -20.8338
    1.6718  -44.2709   -20.8322
    1.7071  -44.2709   -20.8317
    1.7504  -44.2709   -20.8322
    1.7937  -44.2708   -20.8338
    1.8370  -44.2706   -20.8365
    1.8803  -44.2704   -20.8400
    1.9236  -44.2702   -20.8443
    1.9669  -44.2699   -20.8492
    2.0102  -44.2696   -20.8545
    2.0535  -44.2692   -20.8601
    2.0968  -44.2689   -20.8659
    2.1401  -44.2685   -20.8716
    2.1834  -44.2681   -20.8772
    2.2267  -44.2677   -20.8826
    2.2700  -44.2674   -20.8876
    2.3133  -44.2671   -20.8922
    2.3566  -44.2668   -20.8962
    2.3999  -44.2665   -20.8997
    2.4432  -44.2663   -20.9025
    2.4865  -44.2662   -20.9045
    2.5298  -44.2661   -20.9058
    2.5731  -44.2661   -20.9062
    2.6085  -44.2661   -20.9063
    2.6438  -44.2661   -20.9064
    2.6792  -44.2662   -20.9067
    2.7146  -44.2664   -20.9071
    2.7499  -44.2665   -20.9076
    2.7853  -44.2667   -20.9082
    2.8206  -44.2669   -20.9089
    2.8560  -44.2672   -20.9096
    2.8913  -44.2674   -20.9105
    2.9267  -44.2677   -20.9114
    2.9620  -44.2679   -20.9123
    2.9974  -44.2682   -20.9132
    3.0328  -44.2684   -20.9142
    3.0681  -44.2686   -20.9151
    3.1035  -44.2688   -20.9159
    3.1388  -44.2690   -20.9166
    3.1742  -44.2691   -20.9171
    3.2095  -44.2692   -20.9176
    3.2449  -44.2693   -20.9178
    3.2802  -44.2693   -20.9179
    3.2802  -44.2677   -20.9106
    3.3052  -44.2677   -20.9106
    3.3302  -44.2676   -20.9105
    3.3552  -44.2676   -20.9104
    3.3802  -44.2675   -20.9102
    3.4052  -44.2674   -20.9100
    3.4302  -44.2673   -20.9097
    3.4552  -44.2672   -20.9094
    3.4802  -44.2671   -20.9091
    3.5052  -44.2670   -20.9088
    3.5302  -44.2669   -20.9084
    3.5552  -44.2667   -20.9081
    3.5802  -44.2666   -20.9078
    3.6052  -44.2665   -20.9074
    3.6302  -44.2664   -20.9071
    3.6552  -44.2663   -20.9069
    3.6802  -44.2662   -20.9066
    3.7052  -44.2662   -20.9065
    3.7302  -44.2661   -20.9063
    3.7552  -44.2661   -20.9063
    3.7802  -44.2661   -20.9062

但有一个问题!我已经设法用numpy.unique做了一些接近numpy.unique的事情,但我注意到,出于某种原因,Quantum Espresso有时会在第一列块中写入两个或更多相等的值,而第二列中的相应值不同,使用numpy.uniques我会丢失数据.

我try 过这种方法:kp_bands=np.take(bands[:,0],range(0,122),axis=0).这里bands是我用numpy.loadtxt加载数据的地方,122是每个块中值的数量.问题是,情况并不总是一样的.根据所研究的系统的不同,可能会有所不同.

我的问题是:

如何才能在不丢失数据且不知道每个块中有多少行的情况下做到这一点?

推荐答案

Numpy的具体 case

假设我们知道数据被组织为垂直堆叠的多个块,其中每个块由2列组成,第一列是sorted(例如,按递减顺序),所有块都是the same.然后我们可以使用numpy.diff来查找块的数量,使用numpy.split来分隔它们:

bands = np.loadtxt(file_name)
number_of_bands = 1 + np.sum(np.diff(bands[:,0]) < 0)   # or 1 + sum(bands[1:, 0] < bands[:-1, 0])
values = np.split(bands[:,1], number_of_bands)
band_len = values[0].shape[0]
index = bands[:band_len, 0]

result = np.array([index, *values]).T   # F_CONTIGUOUS
# or
result = np.c_[index, *values]          # C_CONTIGUOUS

Pandas 的一般情况

假设我们想要处理以下模式的数据:

data = '''\
    0.0000  -44.2709
    0.0500  -44.2709
    0.0750  -44.2701
    0.0750  -44.2702
    0.1000  -44.2708
    0.1250  -44.2707
    0.1500  -44.2706
    0.0750  -44.2703

    0.0000  -20.8317
    0.0250  -20.8322
    0.0500  -20.8338
    0.0750  -20.8364
    0.0750  -20.8365
    0.1000  -20.8400
    0.1000  -20.8401
    0.1250  -20.8445
'''

我们知道有垂直堆叠的块separated by an empty line.并且第一列块中的数据重叠,但可能未排序并且可能不匹配.

为了与他们合作,我会使用Pandas.我们可以用pandas.read_csv(file, sep='\s+', ...)来阅读它们,但我将在这里使用更专业的pandas.read_fwf方法:

import pandas as pd
from io import StringIO

file = StringIO(data)
first, second = 'first column', 'second column'
df = pd.read_fwf(file, names=[first, second], skip_blank_lines=False)

重要的是,我们保留包含skip_blank_lines=False个参数的空白行,以便在下一步对数据进行分组:

df['block'] = df[first].isna().cumsum()   # assign a unique number to each block
df = df.dropna()                          # drop empty lines

现在,我们的 idea 是使用旋转技术之一,使用第一列作为索引,使用块的编号作为列名.但除非(索引、列)对都是唯一的,否则我们无法做到这一点.在您的情况下,第一列中的数据可以在块内复制.所以我们需要首先解决这个问题.例如,通过对块中重复的值进行编号:

df['inner_id'] = df.groupby(['block', first]).cumcount()

现在我们追踪枢纽数据,例如:

result = df.pivot(columns='block', index=[first, 'inner_id'], values=second)

如果需要,并将它们保存为Numpy数组:

result = result.reset_index(first).to_numpy()

要实验的代码:

import pandas as pd
from io import StringIO

# test data, note the difference between blocks in the first column 
data = '''\
    0.0000  -44.2709
    0.0500  -44.2709
    0.0750  -44.2701
    0.0750  -44.2702
    0.1000  -44.2708
    0.1250  -44.2707
    0.1500  -44.2706
    0.0750  -44.2703

    0.0000  -20.8317
    0.0250  -20.8322
    0.0500  -20.8338
    0.0750  -20.8364
    0.0750  -20.8365
    0.1000  -20.8400
    0.1000  -20.8401
    0.1250  -20.8445
'''

file = StringIO(data)
first, second = 'A', 'B'

df = pd.read_fwf(file, names=[first, second], skip_blank_lines=False)
df['block'] = df[first].isna().cumsum()
df = df.dropna()
df['inner_id'] = df.groupby(['block', first]).cumcount()

result = df.pivot(columns='block', index=[first,'inner_id'], values=second)

测试数据输出:

>>> print(result)

block                 0        1
A     inner_id                  
0.000 0        -44.2709 -20.8317
0.025 0             NaN -20.8322
0.050 0        -44.2709 -20.8338
0.075 0        -44.2701 -20.8364
      1        -44.2702 -20.8365
      2        -44.2703      NaN
0.100 0        -44.2708 -20.8400
      1             NaN -20.8401
0.125 0        -44.2707 -20.8445
0.150 0        -44.2706      NaN

截至原始帖子的数据,我们在两个块的第一列中重复了3.2802的记录.当对这些数据运行上面的代码时,结果也有2条记录,索引为3.2802:

>>> print(result.loc[3.2802])

block           0        1
inner_id                  
0        -44.2693 -20.9179
1        -44.2677 -20.9106

Python相关问答推荐

在Arrow上迭代的快速方法.Julia中包含3000万行和25列的表

模型序列化器中未调用现场验证器

从今天起的future 12个月内使用Python迭代

Python -Polars库中的滚动索引?

优化在numpy数组中非零值周围创建缓冲区的函数的性能

提取两行之间的标题的常规表达

为什么以这种方式调用pd.ExcelWriter会创建无效的文件格式或扩展名?

如何使用Python以编程方式判断和检索Angular网站的动态内容?

在Python argparse包中添加formatter_class MetavarTypeHelpFormatter时, - help不再工作""""

将pandas导出到CSV数据,但在此之前,将日期按最小到最大排序

如何使用SentenceTransformers创建矢量嵌入?

启动带有参数的Python NTFS会导致文件路径混乱

Geopandas未返回正确的缓冲区(单位:米)

Flask Jinja2如果语句总是计算为false&

OpenCV轮廓.很难找到给定图像的所需轮廓

ModuleNotFoundError:没有模块名为x时try 运行我的代码''

判断Python操作:如何从字面上得到所有decorator ?

如何过滤组s最大和最小行使用`transform`'

Pandas 数据帧中的枚举,不能在枚举列上执行GROUP BY吗?

如果服务器设置为不侦听创建,则QWebSocket客户端不连接到QWebSocketServer;如果服务器稍后开始侦听,则不连接