Python 如何在从Pandas 中读取excel文件时以编程方式找出正确的标题

发布于04月04日

我有一个excel文件的列表(.xlsx，.xls)，我试图在加载后获取每个文件的标题.

在这里，我已经采取了一个excel文件，并加载到Pandas 作为.

pd.read_excel("sample.xlsx")

输出为:

在这里，我们希望根据我们的要求获得标题信息，在所附的图片中，所需的标题存在于索引8处，如您可以看到的红色编码.

pd.read_excel('sample.xlsx',skiprows=9)

正如我们现在知道的，我们在8有一个正确的标题，我可以返回并在read_excel中指定为8的skip_行，这样它从这个索引中读取，标题将显示为.

如何以编程方式处理excel文件列表中的此类情况，而我们不知道文件头的位置？在这种情况下，我们已经知道头是8.但如果我们在其他文件中不知道呢.

可以下载示例文件以供参考:

推荐答案

使用:

df = pd.read_excel('sample_file.xlsx')

#test all rows if previous row is only NaNs
m1 = df.shift(fill_value=0).isna().all(axis=1)
#test all rows if no NaNs
m2 = df.notna().all(axis=1)
#chain together and filter all next rows after first match
df = df[(m1 & m2).cummax()]

#set first row to columns names
df = df.set_axis(df.iloc[0].rename(None), axis=1).iloc[1:].reset_index(drop=True)

print (df)
   LN  FN          SSN        DOB        DOH Gender  Comp_2011 Comp_2010  \
0  Ax  Bx  000-00-0000   8/3/1800   1/1/1800   Male  384025.56    396317   
1  Er  Ds  000-00-0000   5/7/1800   7/1/1800   Male  382263.86    392474   
2  Po  Ch  000-00-0000   9/9/1800   1/1/1800   Male  406799.34    395677   
3  Rt  Da  000-00-0000  6/24/1800   7/1/1800   Male  395767.12    424093   
4  Yh  St  000-00-0000  3/15/1800   7/1/1800   Male  376936.58    373754   
5  Ws  Ra  000-00-0000  6/12/1800  7/10/1800   Male  425720.06    420927   

  Comp_2009 Allocation Group                  NRD  
0    360000             0.05  2022-09-01 00:00:00  
1    360000             0.05  2015-06-01 00:00:00  
2    360000             0.05  2013-01-01 00:00:00  
3    360000             0.05  2020-07-01 00:00:00  
4    360000                0  2013-01-01 00:00:00  
5    306960                0  2034-07-01 00:00:00