I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:

<text xml:lang="">
    <body>
      <div>
        <p>
          <p>
            <lb xml:id="p1z1" />19.
                    <lb xml:id="p1z2" />esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                    <lb xml:id="p1z3" />esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
                    <lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
                    <lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
                    <lb xml:id="p1z6" />fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
                    <lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                    <lb xml:id="p1z8" />vel generum mihi per literas responsurum. Frater igitur dixit quidem
                    <lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                    <lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                    <lb xml:id="p1z11" />promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
                    <lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
                    <lb xml:id="p1z13" />ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
                    <lb xml:id="p1z14" />nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
                    <lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
                    <lb xml:id="p1z16" />per Mosen. Mea est ultro et ego retribuam eis in tempore.
                    <lb xml:id="p1z17" />De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
                    <lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
        </p>
      </div>
    </body>
  </text>
</TEI>

The sentences I need to tag span over several lines. The lines are tagged with the line break tag "100".我需要以某种方式标记这些句子,然后将它们与其原始形式附加到文件中.我遇到的问题是,当文本包含换行符时,只要我创建一个句子的实例并try 附加到换行符标记,换行符就无效……

输出应如下所示:

<text xml:lang="">
    <body>
      <div>
        <p>
          <p>
            <lb xml:id="p1z1" /><s n="1" xml:lang="la">19.</s>
                    <lb xml:id="p1z2" /><s n="1" xml:lang="la">esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
                    <lb xml:id="p1z3" />esse epistolam meam interpretatum.</s><s n="2" xml:lang="la"> Caeterum, quod scribis te ex consilio consanguine
                    <lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
                    <lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit.</s><s n="3" xml:lang="la"> Res enim ista ad me suum ad
                    <lb xml:id="p1z6" />fratrem pertinebat.</s><s n="4" xml:lang="la"> Nec ita fueram abs te dimissus, quod vel tu tale
                    <lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
                    <lb xml:id="p1z8" />vel generum mihi per literas responsurum.</s><s n="5" xml:lang="la"> Frater igitur dixit quidem
                    <lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
                    <lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
                    <lb xml:id="p1z11" />promisisses, ita faceres.</s><s n="6" xml:lang="la"> Ego simulatque tergiversationem istam cognoscere
                    <lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
                    <lb xml:id="p1z13" />ut dicitur.</s><s n="7" xml:lang="la"> Nam quae plana sunt et integra sive dicantur sive scripsisse
                    <lb xml:id="p1z14" />nihil refert.</s><s n="8" xml:lang="la"> Utut sit, ego iniuriam illam, ex qua omnes istae
                    <lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
                    <lb xml:id="p1z16" />per Mosen.</s><s n="9" xml:lang="la"> Mea est ultro et ego retribuam eis in tempore.</s>
                    <lb xml:id="p1z17" /><s n="10" xml:lang="la">De altero etiam capite accipio tuam excusationem.</s><s n="11" xml:lang="la"> Quum enim tam sancte
                    <lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
       </p>
      </div>
    </body>
  </text>
</TEI>

我的代码如下所示:

import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk

# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')

def remove_ns_prefix(tree):
    for elem in tree.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]  # Removing namespace
    return tree

def process_file(input_xml, output_xml):
    tree = ET.parse(input_xml)
    root = remove_ns_prefix(tree.getroot())

    for body in root.findall('.//body'):
        for paragraph in body.findall('.//p'):
            # Extract all lb elements and following texts
            lb_elements = list(paragraph.findall('.//lb'))
            lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements]  # Store lb ids
            text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
            
            # Combine the text and tokenize into sentences
            entire_text = ' '.join(text_after_lb)
            sentences = sent_tokenize(entire_text)
            sentences2 = " ".join(sentences).split("\n")
            print(sentences2)
            
            # Clear the paragraph's existing content
            paragraph.clear()

            # Pair up lb tags and sentences using zip, reinsert them into the paragraph
            for lb_id, sentence in zip(lb_ids, sentences):
                # Reinsert lb element
                lb_attrib = {'xml:id': lb_id} if lb_id else {}
                new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
                # Attach sentence to this lb
                if sentence:
                    sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
                    sentence_elem.text = sentence

    # Write the modified tree to a new file
    tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')

我快疯了.希望我有一位愿意来拯救我的XML专业人士.

我还try 了先加标签,然后再重新插入换行符标签,但由于XML的性质,这很难做到.接下来我可能会try 创建临时的.txt文件,逐行查找,并在不匹配的行上插入标签……

在这一点上,任何和所有的帮助都是感激的.

推荐答案

这项工作可以利用lb个元素的tail个属性来完成,这些元素是该列表中索引为0的项(element.ail被r'(\.|\n)' regexp拆分).放置标签元素以检测句子的开始和结束(点).

['<lb xml:id="p1z1"/>', '19', '.', '', '\n', '            ']

该列表表示该元素;引用以显示空格

'<lb xml:id="p1z1"/>19.
                '

脚本不考虑名称空间,并作为解析技术的POC提供. 用自动结束元素来标记句子可能会更干净

<lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine

给出这个样本

<text xml:lang="">
  <body>
    <div>
      <p>
        <p>
            <lb xml:id="p1z1"/>19.
            <lb xml:id="p1z2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
            <lb xml:id="p1z3"/>esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
            <lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
            <lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
            <lb xml:id="p1z6"/>fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
            <lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
            <lb xml:id="p1z8"/>vel generum mihi per literas responsurum. Frater igitur dixit quidem
            <lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
            <lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
            <lb xml:id="p1z11"/>promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
            <lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
            <lb xml:id="p1z13"/>ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
            <lb xml:id="p1z14"/>nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
            <lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
            <lb xml:id="p1z16"/>per Mosen. Mea est ultro et ego retribuam eis in tempore.
            <lb xml:id="p1z17"/>De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
            <lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
        </p>
      </p>
    </div>
  </body>
</text>

结果

<text xml:lang="">
  <body>
    <div>
      <p>
        <p>
            <lb xml:id="p1z1"/><s n="1"/>19.
            <lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
            <lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine
            <lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
            <lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit.<s n="4"/> Res enim ista ad me suum ad
            <lb xml:id="p1z6"/>fratrem pertinebat.<s n="5"/> Nec ita fueram abs te dimissus, quod vel tu tale
            <lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
            <lb xml:id="p1z8"/>vel generum mihi per literas responsurum.<s n="6"/> Frater igitur dixit quidem
            <lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
            <lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
            <lb xml:id="p1z11"/>promisisses, ita faceres.<s n="7"/> Ego simulatque tergiversationem istam cognoscere
            <lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
            <lb xml:id="p1z13"/>ut dicitur.<s n="8"/> Nam quae plana sunt et integra sive dicantur sive scripsisse
            <lb xml:id="p1z14"/>nihil refert.<s n="9"/> Utut sit, ego iniuriam illam, ex qua omnes istae
            <lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
            <lb xml:id="p1z16"/>per Mosen.<s n="10"/> Mea est ultro et ego retribuam eis in tempore.
            <lb xml:id="p1z17"/><s n="11"/>De altero etiam capite accipio tuam excusationem.<s n="12"/> Quum enim tam sancte
            <lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
        </p>
      </p>
    </div>
  </body>
</text>

设置self_close = False以获取操作员的标签.通过将解析的元素恢复回文档

import re
from lxml import etree
doc = etree.parse('/home/luis/tmp/tmp.xml')
# find parent element
parent = doc.xpath('//div/p/p')[0]

# keep indentation of first lb
all='<p>' + parent.text
i=1
is_open=False
self_close = True
for t in parent.xpath('lb'):
  parts = ['']
  parts.extend(re.split(r'(\.|\n)', t.tail))
  
  t.tail=None
  parts[0]=etree.tostring(t).decode('utf-8')

  #print(parts)
  for p, e in enumerate(parts):
    skip = (e == '' or re.match(r'^(\n|\s+)$', e) is not None)
    
    if p > 0 and not is_open and not skip:
      if self_close:
        parts[p] = f'<s n="{i}"/>{e}'
      else:
        parts[p] = f'<s n="{i}">{e}'
        
      is_open=True
    elif is_open and e == '.':
      if not self_close:
        parts[p] = '.</s>'
      is_open=False
      i += 1
    elif p == len(parts) - 1:
        all += ''.join(parts)
    else:
      continue

# last sentence does not end with a dot?
# hardcoded here but could be detected
if not self_close:
  all+='</s>'

all +='</p>'
# parse back to an element
xfrag = etree.fromstring(all)
xfrag.tail = parent.tail

# replace parent element on document
parent.getparent().replace(parent, xfrag)
print(etree.tostring(doc).decode('utf-8'))

Python相关问答推荐

在Python中处理大量CSV文件中的数据

输出中带有南的亚麻神经网络

优化pytorch函数以消除for循环

ODE集成中如何终止solve_ivp的无限运行

如何使用表达式将字符串解压缩到Polars DataFrame中的多个列中?

Pre—Commit MyPy无法禁用非错误消息

关于Python异步编程的问题和使用await/await def关键字

NumPy中条件嵌套for循环的向量化

Asyncio:如何从子进程中读取stdout?

为什么\b在这个正则表达式中不解释为反斜杠

lityter不让我输入左边的方括号,'

python—telegraph—bot send_voice发送空文件

基于Scipy插值法的三次样条系数

pandas fill和bfill基于另一列中的条件

如何在Python中使用Iscolc迭代器实现观察者模式?

如何求相邻对序列中元素 Select 的最小代价

如何在Python中将超链接添加到PDF中每个页面的顶部?

如何用FFT确定频变幅值

Python协议不兼容警告

将参数从另一个python脚本中传递给main(argv