这是我的密码.我的工作是使用代码检测到文件夹中的一堆文本文件,然后将字符串解析为CSV文件输出的数据.你能就如何做这件事给我一些提示吗?我在苦苦挣扎.

我的代码的第一步是检测数据在txt文件中的位置.我发现所有数据都以‘Read’开头,然后我找到了每个文件中数据的开始行.在那之后,我在如何将数据输出导出为CSV文件方面遇到了困难.

import os
import argparse
import csv
from typing import List


def validate_directory(path):
    if os.path.isdir(path):
        return path
    else:
        raise NotADirectoryError(path)


def get_data_from_file(file) -> List[str]:
    ignore_list = ["Read Segment", "Read Disk", "Read a line", "Read in"]
    data = []
    with open(file, "r", encoding="latin1") as f:
        try:
            lines = f.readlines()
        except Exception as e:
            print(f"Unable to process {file}: {e}")
            return []
        for line_number, line in enumerate(lines, start=1):
            if not any(variation in line for variation in ignore_list):
                if line.strip().startswith("Read ") and not line.strip().startswith("Read ("): # TODO: fix this with better regex
                    data.append(f'Found "Read" at line {line_number} in {file}')
                    print(f'Found "Read" at {file}:{line_number}')
                    print(lines[line_number-1])
    return data


def list_read_data(directory_path: str) -> List[str]:
    total_data = []
    for root, _, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".txt"):
                data = get_data_from_file(os.path.join(root, file_name))
                total_data.extend(data)

    return total_data


def write_results_to_csv(output_file: str, data: List[str]):
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Results"])
        for line in data:
            writer.writerow([line])


def main(directory_path: str, output_file: str):
    data = list_read_data(directory_path)
    write_results_to_csv(output_file, data)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Process the 2020Model folder for input data."
    )
    parser.add_argument(
        "--directory", type=validate_directory, help="folder to be processed"
    )
    parser.add_argument("--output", type=str, help="Output file name (e.g., outputfile.csv)", default="outputfile.csv")

    args = parser.parse_args()
    main(os.path.abspath(args.directory), args.output)

以下是我理想的CSV输出数据:

1985 1986 1986 1987 1988 1989 1990 1991 1992 1993 1994
37839 36962 37856 41971 40838 44640.87 42826.34 44883.03 43077.59 45006.49 46789

你能给我一些提示吗?

  • 将字符串解析放在哪里?
  • 如何输出为CSV文件.

下面是一个示例txt文件:

Select Year(2007-2025)
Read TotPkSav
/2007     2008     2009     2010     2011     2012     2013     2014     2015     2016     2017     2018     2019     2020     2021     2022     2023     2024     2025 
   00       27       53       78      108      133      151      161      169      177      186      195      205      216      229      242      257      273      288 

推荐答案

如果您的所有文件看起来都像这4行,那么我建议您只需在前面将文件转换为一列行,而不是try 逐行/迭代这些行.我还建议只使用GLOB和RECURSIVE=True,并避免try 遍历树.

因为它读取for-循环中的文件,所以只需将continue-ing跳到循环中的下一个文件,就可以跳过任何具有错误属性的文件:

all_rows: list[list[str]] = []

for fname in glob.glob("**/*.txt", recursive=True):
    with open(fname, encoding="iso-8859-1") as f:
        print(f"reading {fname}")
        lines = [x.strip() for x in list(f)]

        if len(lines) != 4:
            print(f'skipping {fname} with too few lines"')
            continue

        line2 = lines[1]
        if line2[:4] != "Read" or line2[:6] == "Read (":
            print(f'skipping {fname} with line2 = "{line2}"')
            continue

        line3, line4 = lines[2:4]

        if line3[0] == "/":
            line3 = line3[1:]

        header = [x for x in line3.split(" ") if x]
        data = [x for x in line4.split(" ") if x]
      
        all_rows.append(header)
        all_rows.append(data)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Result"])
    writer.writerows(all_rows)

我又模拟了几个文件,并将它们散布在我的树上:

 - .
 - a
    input3.txt
 - b
    foo.txt
   input1.txt
   input2.txt
   main.py

当我从该树的根运行该程序时,我得到:

reading input1.txt
reading input2.txt
skipping input2.txt with line2 = "Read (TotPkSav)"
reading a/input3.txt
reading b/foo.txt
skipping b/foo.txt with too few lines"

Output.csv如下所示:

| Result |
|--------|
| 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |
| 2099 | 2098 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |

Python相关问答推荐

Image Font生成带有条形码Code 128的条形码时出现枕头错误OSErsor:无法打开资源

剧作家Python没有得到回应

如何使用Jinja语法在HTML中重定向期间传递变量?

使可滚动框架在tkinter环境中看起来自然

try 将一行连接到Tensorflow中的矩阵

在Python中动态计算范围

Django RawSQL注释字段

如何启动下载并在不击中磁盘的情况下呈现响应?

为什么\b在这个正则表达式中不解释为反斜杠

寻找Regex模式返回与我当前函数类似的结果

Numpyro AR(1)均值切换模型抽样不一致性

在Docker容器(Alpine)上运行的Python应用程序中读取. accdb数据库

使用嵌套对象字段的Qdrant过滤

504未连接IB API TWS错误—即使API连接显示已接受''

使用SQLAlchemy从多线程Python应用程序在postgr中插入多行的最佳方法是什么?'

如何训练每一个pandaprame行的线性回归并生成斜率

浏览超过10k页获取数据,解析:欧洲搜索服务:从欧盟站点收集机会的微小刮刀&

Python OPCUA,modbus通信代码运行3小时后出现RuntimeError

SpaCy:Regex模式在基于规则的匹配器中不起作用

TypeError:';Locator';对象无法在PlayWriter中使用.first()调用