我有这个数据框:

import pandas as pd

url = "https://www-genesis.destatis.de/genesisWS/rest/2020/data/tablefile?username=DEB924AL95&password=P@ssword123&name=45213-0005&area=all&compress=false&transpose=false&startyear=1900&endyear=&timeslices=&regionalvariable=&regionalkey=&classifyingvariable1=WERTE4&classifyingkey1=REAL&classifyingvariable2=WERT03&classifyingkey2=BV4KSB&classifyingvariable3=WZ2008&classifyingkey3=WZ08-551&format=xlsx&job=false&stand=01.01.1970&language=en"


df = pd.read_excel(url)

#df.head(20)

df = df.iloc[7:-5]
df
                                        
Turnover from accommodation and food services: Germany,\nmonths, price types, original and adjusted data, economic\nactivities  Unnamed: 1  Unnamed: 2

WZ08-55 Accommodation       NaN     NaN  
                1994    January     121.9
                 NaN   February     122.0
                          March     121.4
                          April     122.1
                            May     120.1
                           June     123.1
                           July     125.5
                         August     126.1
                      September     127.8
                        October     124.3
                       November     121.8
                       December     121.7
               1995     January     120.9
                NaN    February     121.5
                          March     120.8

预期的结果应该是这样的.WZ08-55 Accomodation是一个行业的名字.这一栏中有许多这样的行业名称.Industry names应该是Column Headers.而在行业名称旁边的行中开始的年份需要用月份和日期列来表示.

行业名称后接下一行中的年份,其余行为空.我不知道该怎么做.

enter image description here

推荐答案

首先清理数据,然后reshape 数据帧:

# Clean column names
df.columns = ['Variable', 'Date', 'Value']

# Boolean mask
m = df['Date'].isna()

# Clean data
df['Date'] += '-' + df['Variable'].ffill()
df['Variable'] = df['Variable'].where(m).ffill()

# Reshape your dataframe
out = (df[~m].replace('...', np.nan)
             .pivot_table(index='Date', columns='Variable',
                          values='Value', sort=False)
             .reset_index().rename_axis(columns=None))

输出:

>>> out
              Date  WZ08-55 Accommodation  ...  WZ08-561-01 Restaurants  WZ08-55-01 Accommodation and food and beverage service act.
0     January-1994                  121.9  ...                    193.2                                              152.5          
1    February-1994                  122.0  ...                    189.0                                              150.7          
2       March-1994                  121.4  ...                    192.2                                              152.2          
3       April-1994                  122.1  ...                    189.0                                              150.6          
4         May-1994                  120.1  ...                    189.9                                              150.1          
..             ...                    ...  ...                      ...                                                ...          
351     April-2023                   98.6  ...                     90.0                                               91.8          
352       May-2023                   90.9  ...                     85.4                                               87.3          
353      June-2023                   88.2  ...                     84.7                                               87.3          
354      July-2023                   84.4  ...                     84.3                                               84.7          
355    August-2023                   82.3  ...                     83.2                                               83.2          

[356 rows x 12 columns]

编辑:要获得预期的日期格式,您可以替换:

df['Date'] = (pd.to_datetime(df['Date'] + '-' + df['Variable'].ffill(), format='%B-%Y')
                .dt.strftime('%b-%y'))

有:

df['Date'] = pd.to_datetime(df['Date'], format='%B-%Y').dt.strftime('%b-%y')

Python相关问答推荐

我如何区分PyTorch张量和嵌套张量?

Pandas 多重索引不返回级别和标签

SqlalChemy-同时过滤父对象和子对象

Django URL中不需要的空格

如何在子窗口中正确设置和获取tkinter旋转框的值?

用TensorFlow神经网络绘制趋势线

BeautifulSoup:迭代元素列表并按类提取文本

将Hangman游戏中的&替换为所有比赛的玩家猜测

棋类游戏的极大极小函数

匹配具有给定异常的给定格式的所有字符串的正则表达式

使用LIST将数据框列的子集映射到字典值

为什么Pandas Value_Counts()会生成元组作为索引?

极点组依据,在计算平均值时忽略NAN

海上热图传奇用次要情节扰乱情节秩序

如何禁止Python yfinance包的输出消息?

在plt.matshow()中指定Cmap时出错

使用NumPy向量化Python循环

Kubernetes Docker容器Pod中无法访问ScrapyRT端口

Pandas 数据帧到滑动窗口

从具有Python数据类的超类继承方法