Python 从空间对齐的表格数据中解析多行表格单元格

发布于03月19日

我生成了一个有点凌乱的文件，它只是将所有内容都转储到HTML<pre>标记and中，决定将标题分成两行.我是一名Python和regex新手，我很难想出一种方法来正确地将这两行合并为一行，以获得一行上的列标题并进行匹配，最终目标是将整个文件解析为字段.

Here is an example of how it looks on the web:

我想要做的是在一行中匹配字段.例如，如果我只是go 掉多余的空格，"时钟"将与终结器匹配，而不是与时间匹配.我想要的是:

ID#|地点|上课地点|终结者|时钟时间|净时间|节奏

以下是实际的HTML:

</B>             CLASS                                            CLOCK       NET    
  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE

推荐答案

这段代码可以完成这项工作.我们使用以下假设将两行中的标题分开:text in the two lines whose indices overlap or are immediately adjacent belongs to the same heading；when both lines have a space in a particular position, we can assume that the material on each side belongs to separate headings.不需要正则表达式.

# read in the 2 lines:
line1 = '             CLASS                                            CLOCK       NET    '
line2 = '  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE  '

# pad the shorter among the lines, so that both are equally long:
linediff = len(line1) - len(line2)
if linediff > 0:
    line2 += ' ' * linediff
else:
    line1 += ' ' * (-linediff)
length = len(line1)

# go through both lines character-by-character:
top, bottom = [], []
i = 0
while i < length:
    # skip indices where both lines have a space:
    if line1[i] == ' ' and line2[i] == ' ':
        i += 1
    else:
        # find the first j to the right of i for which
        # both lines have a space:
        j = i
        while (j < length) and (line1[j] != ' ' or line2[j] != ' '):
            j += 1
        # copy the lines from position i (inclusive)
        # to j (exclusive) into top and bottom:
        top.append(line1[i:j])
        bottom.append(line2[i:j])
        # we are done with one heading and advance i:
        i = j

# top:
# ['   ', '     ', 'CLASS', '        ', ' CLOCK', '  NET', '    ']
# bottom:
# ['ID#', 'PLACE', 'PLACE', 'FINISHER', 'TIME  ', 'TIME ', 'PACE']

headers = []
for str1, str2 in zip(top, bottom):
    # remove leading/trailing spaces from each partial heading:
    s1, s2 = str1.strip(), str2.strip()
    # merge partial headings
    # (strip is needed because one of the two might be empty):
    headers.append((s1 + ' ' + s2).strip())

# headers:
# ['ID#', 'PLACE', 'CLASS PLACE', 'FINISHER', 'CLOCK TIME', 'NET TIME', 'PACE']

请注意，该问题实际上与HTML无关，因此不需要任何特殊的HTML处理.

Python 从空间对齐的表格数据中解析多行表格单元格

推荐答案

Python相关问答推荐

理解Python的二分库：澄清bisect_left的使用

TARete错误：类型对象任务没有属性模型'

如何将ctyles.POINTER(ctyles.c_float)转换为int？

numba jitClass，记录类型为字符串

关于Python异步编程的问题和使用await/await def关键字

如何使Matplotlib标题以图形为中心，而图例框则以图形为中心

什么是合并两个embrame的最佳方法，其中一个有日期范围，另一个有日期没有任何共享列？

在matplotlib中删除子图之间的间隙_mosaic

dask无groupby(ddf. agg([min，max])？''''

合并与拼接并举

PYTHON、VLC、RTSP.屏幕截图不起作用

Flask运行时无法在Python中打印到控制台

BeautifulSoup：超过24个字符(从a到z)的迭代失败：降低了首次深入了解数据集的复杂性：

使用polars. pivot()旋转一个框架(类似于R中的pivot_longer)

没有内置pip模块的Python3.11--S在做什么？

合并相似列表

为罕见情况下的回退None值键入

启动线程时，Python键盘模块冻结/不工作

在任何要保留的字段中添加引号的文件，就像在Pandas 中一样

如何在PYTHON中向单元测试S Side_Effect发送额外参数？