Python 如何协调包含标记化字符串的元组列表与原始字符串

发布于05月21日

我正在try 使idx_tag_token与原始字符串word_string保持一致，idx_tag_token是包含标记化字符串及其标签和字符索引的元组列表.我想输出一个元组列表，每个元组包含原始字符串的一个元素(如果使用空格拆分)，以及idx_tag_token的标签信息.

我已经编写了一些代码，可以根据字符索引在word_string中查找与标记相关联的单词.然后，我创建了一个包含每个单词和相关标签的元组列表.这被定义为word_tag_list.然而，基于此，我不确定如何继续创建所需的输出.

更新标签的条件并不复杂，但我在这里想不出合适的系统.

任何帮助都将不胜感激.

数据:

word_string = "At London, the 12th in February, 1942, and for that that reason Mark's (3) wins, American parts"

idx_tag_token =[(0, 'O', 'At'),
                (3, 'GPE-B', 'London'),
                (9, 'O', ','),
                (11, 'DATE-B', 'the'),
                (15, 'DATE-I', '12th'),
                (20, 'O', 'in'),
                (23, 'DATE-B', 'February'),
                (31, 'DATE-I', ','),
                (33, 'DATE-I', '1942'),
                (37, 'O', ','),
                (39, 'O', 'and'),
                (43, 'O', 'for'),
                (47, 'O', 'that'),
                (52, 'O', 'that'),
                (57, 'O', 'reason'),
                (64, 'PERSON-B', 'Mark'),
                (68, 'O', "'s"),
                (71, 'O', '('),
                (72, 'O', '3'),
                (73, 'O', ')'),
                (75, 'O', 'wins'),
                (79, 'O', ','),
                (81, 'NORP-B', 'American'),
                (90, 'O', 'parts')]

我的代码是:

def find_word_from_index(idx, word_string):
    words = word_string.split()
    current_index = 0

    for word in words:
        start_index = current_index
        end_index = current_index + len(word) - 1
        if start_index <= idx <= end_index:
            return word
        current_index = end_index + 2
    return None


word_tag_list = []
for index, tag, _ in idx_tag_token:
    word = find_word_from_index(index, word_string)
    word_tag_list.append((word, tag))
word_tag_list

当前输出:

[('At', 'O'),
 ('London,', 'GPE-B'),
 ('London,', 'O'),
 ('the', 'DATE-B'),
 ('12th', 'DATE-I'),
 ('in', 'O'),
 ('February,', 'DATE-B'),
 ('February,', 'DATE-I'),
 ('1942,', 'DATE-I'),
 ('1942,', 'O'),
 ('and', 'O'),
 ('for', 'O'),
 ('that', 'O'),
 ('that', 'O'),
 ('reason', 'O'),
 ("Mark's", 'PERSON-B'),
 ("Mark's", 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('wins,', 'O'),
 ('wins,', 'O'),
 ('American', 'NORP-B'),
 ('parts', 'O')]

请注意，情况并非如此.

[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]

def get_tokens(tokens): it = iter(tokens) _, token_type, next_token = next(it) word = yield while True: if next_token == word: word = yield next_token, token_type _, token_type, next_token = next(it) else: _, _, tmp = next(it) next_token += tmp it = get_tokens(idx_tag_token) next(it) out = [it.send(w) for w in word_string.split()] print(out)

[ ("At", "O"), ("London,", "GPE-B"), ("the", "DATE-B"), ("12th", "DATE-I"), ("in", "O"), ("February,", "DATE-B"), ("1942,", "DATE-I"), ("and", "O"), ("for", "O"), ("that", "O"), ("that", "O"), ("reason", "O"), ("Mark's", "PERSON-B"), ("(3)", "O"), ("wins,", "O"), ("American", "NORP-B"), ("parts", "O"), ]

Python 如何协调包含标记化字符串的元组列表与原始字符串

推荐答案

Python相关问答推荐

通过交换 node 对链接列表进行 Select 排序

在Pandas框架中截短至固定数量的列

根据给定日期的状态过滤查询集

如何根据另一列值用字典中的值替换列值

Pandas 填充条件是另一列

Python会扔掉未使用的表情吗？

使用plotnine和Python构建地块

根据不同列的值在收件箱中移动数据

将数据框架与导入的Excel文件一起使用

Polars：用氨纶的其他部分替换氨纶的部分

如何获得每个组的时间戳差异？

在np数组上实现无重叠的二维滑动窗口

如何从数据库上传数据到html？

在ubuntu上安装dlib时出错

实现神经网络代码时的TypeError

如何使用Numpy. stracards重新编写滚动和？

基于形状而非距离的两个numpy数组相似性

Flask Jinja2如果语句总是计算为false&

比Pandas 更好的 Select

仅取消堆叠最后三列