I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:个
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1" />19.
<lb xml:id="p1z2" />esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3" />esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
<lb xml:id="p1z6" />fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8" />vel generum mihi per literas responsurum. Frater igitur dixit quidem
<lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11" />promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13" />ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14" />nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16" />per Mosen. Mea est ultro et ego retribuam eis in tempore.
<lb xml:id="p1z17" />De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
<lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
</p>
</div>
</body>
</text>
</TEI>
The sentences I need to tag span over several lines. The lines are tagged with the line break tag "100".我需要以某种方式标记这些句子,然后将它们与其原始形式附加到文件中.我遇到的问题是,当文本包含换行符时,只要我创建一个句子的实例并try 附加到换行符标记,换行符就无效……
输出应如下所示:
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1" /><s n="1" xml:lang="la">19.</s>
<lb xml:id="p1z2" /><s n="1" xml:lang="la">esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3" />esse epistolam meam interpretatum.</s><s n="2" xml:lang="la"> Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit.</s><s n="3" xml:lang="la"> Res enim ista ad me suum ad
<lb xml:id="p1z6" />fratrem pertinebat.</s><s n="4" xml:lang="la"> Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8" />vel generum mihi per literas responsurum.</s><s n="5" xml:lang="la"> Frater igitur dixit quidem
<lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11" />promisisses, ita faceres.</s><s n="6" xml:lang="la"> Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13" />ut dicitur.</s><s n="7" xml:lang="la"> Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14" />nihil refert.</s><s n="8" xml:lang="la"> Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16" />per Mosen.</s><s n="9" xml:lang="la"> Mea est ultro et ego retribuam eis in tempore.</s>
<lb xml:id="p1z17" /><s n="10" xml:lang="la">De altero etiam capite accipio tuam excusationem.</s><s n="11" xml:lang="la"> Quum enim tam sancte
<lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
</p>
</div>
</body>
</text>
</TEI>
我的代码如下所示:
import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk
# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')
def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1] # Removing namespace
return tree
def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())
for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements] # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = " ".join(sentences).split("\n")
print(sentences2)
# Clear the paragraph's existing content
paragraph.clear()
# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence
# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')
我快疯了.希望我有一位愿意来拯救我的XML专业人士.
我还try 了先加标签,然后再重新插入换行符标签,但由于XML的性质,这很难做到.接下来我可能会try 创建临时的.txt文件,逐行查找,并在不匹配的行上插入标签……
在这一点上,任何和所有的帮助都是感激的.