我有一个字符串,每个标题下都有标题和项目符号.如何将其转换为具有2列-标题和点-的数据帧?还想清理点列中的文本以删除项目符号字符:、-.

mystr=' Here is a summary of the key financial trends for XYZ based on the earnings call transcript:\n\nHeader 0:\n- Q4 revenue was $2.7 billion, down 12% sequentially and 16% year-over-year due to broad-based weakness\n- Gross margin was 70.2%, down sequentially and year-over-year\n- Operating margin was 44.7%. FY2023 operating margin was 48.9%, down 50 basis points  \n- Q4 EPS was $2.01, slightly above outlook\n\nHeader 1:  \n- Q4 inventory decreased by $70 million sequentially to 188 days\n- Reduced Q4 OpEx by $60M sequentially through discretionary cuts and lower variable comp\n\nHeader 2:\n- Industrial revenue down 19% sequentially and 20% year-over-year in Q4 on broad-based weakness\n- Automotive revenue down slightly sequentially, up 14% year-over-year in Q4\n- Communications revenue down 6% sequentially, 32% year-over-year in Q4  \n- Consumer revenue down 6% sequentially, 28% year-over-year in Q4\n\nHeader 3:\n- Q1 revenue guidance $2.5 billion +/- $100 million. Expect all end markets down sequentially\n- Expect inventory correction to taper through 1H of FY2024'

预期输出:

enter image description here

推荐答案

下面是一个如何使用re模块解析文本的示例:

import pandas as pd

mystr = " Here is a summary of the key financial trends for XYZ based on the earnings call transcript:\n\nHeader 0:\n- Q4 revenue was $2.7 billion, down 12% sequentially and 16% year-over-year due to broad-based weakness\n- Gross margin was 70.2%, down sequentially and year-over-year\n- Operating margin was 44.7%. FY2023 operating margin was 48.9%, down 50 basis points  \n- Q4 EPS was $2.01, slightly above outlook\n\nHeader 1:  \n- Q4 inventory decreased by $70 million sequentially to 188 days\n- Reduced Q4 OpEx by $60M sequentially through discretionary cuts and lower variable comp\n\nHeader 2:\n- Industrial revenue down 19% sequentially and 20% year-over-year in Q4 on broad-based weakness\n- Automotive revenue down slightly sequentially, up 14% year-over-year in Q4\n- Communications revenue down 6% sequentially, 32% year-over-year in Q4  \n- Consumer revenue down 6% sequentially, 28% year-over-year in Q4\n\nHeader 3:\n- Q1 revenue guidance $2.5 billion +/- $100 million. Expect all end markets down sequentially\n- Expect inventory correction to taper through 1H of FY2024"

all_data = []
for header, group in re.findall(
    r"^([^-].*?):(.*?)(?=^[^-].*?:|\Z)", mystr, flags=re.S | re.M
):
    header = header.strip()
    for line in re.findall(r"^\s*-\s*(.+?)\s*$", group, flags=re.M):
        all_data.append((header, line))

df = pd.DataFrame(all_data, columns=["header", "point"])
print(df)

打印:

      header                                                                                                  point
0   Header 0  Q4 revenue was $2.7 billion, down 12% sequentially and 16% year-over-year due to broad-based weakness
1   Header 0                                           Gross margin was 70.2%, down sequentially and year-over-year
2   Header 0                    Operating margin was 44.7%. FY2023 operating margin was 48.9%, down 50 basis points
3   Header 0                                                               Q4 EPS was $2.01, slightly above outlook
4   Header 1                                         Q4 inventory decreased by $70 million sequentially to 188 days
5   Header 1                Reduced Q4 OpEx by $60M sequentially through discretionary cuts and lower variable comp
6   Header 2          Industrial revenue down 19% sequentially and 20% year-over-year in Q4 on broad-based weakness
7   Header 2                             Automotive revenue down slightly sequentially, up 14% year-over-year in Q4
8   Header 2                                  Communications revenue down 6% sequentially, 32% year-over-year in Q4
9   Header 2                                        Consumer revenue down 6% sequentially, 28% year-over-year in Q4
10  Header 3            Q1 revenue guidance $2.5 billion +/- $100 million. Expect all end markets down sequentially
11  Header 3                                              Expect inventory correction to taper through 1H of FY2024

Python相关问答推荐

使用itertools出现第n个子串

手动为pandas中的列上色

如何才能将每个组比上一组增加N %?

Polars Select 多个元素产品

如何计算列表列行之间的公共元素

无法使用equals_html从网址获取全文

Python多处理:当我在一个巨大的pandas数据框架上启动许多进程时,程序就会陷入困境

不理解Value错误:在Python中使用迭代对象设置时必须具有相等的len键和值

PMMLPipeline._ fit()需要2到3个位置参数,但给出了4个位置参数

Streamlit应用程序中的Plotly条形图中未正确显示Y轴刻度

无法使用DBFS File API路径附加到CSV In Datricks(OSError Errno 95操作不支持)

对象的`__call__`方法的setattr在Python中不起作用'

将JSON对象转换为Dataframe

计算分布的标准差

调用decorator返回原始函数的输出

为什么numpy. vectorize调用vectorized函数的次数比vector中的元素要多?

干燥化与列姆化的比较

计算空值

并行编程:同步进程

如何根据rame中的列值分别分组值