我正在try 使idx_tag_token与原始字符串word_string保持一致,idx_tag_token是包含标记化字符串及其标签和字符索引的元组列表.我想输出一个元组列表,每个元组包含原始字符串的一个元素(如果使用空格拆分),以及idx_tag_token的标签信息.

我已经编写了一些代码,可以根据字符索引在word_string中查找与标记相关联的单词.然后,我创建了一个包含每个单词和相关标签的元组列表.这被定义为word_tag_list.然而,基于此,我不确定如何继续创建所需的输出.

更新标签的条件并不复杂,但我在这里想不出合适的系统.

任何帮助都将不胜感激.

数据:

word_string = "At London, the 12th in February, 1942, and for that that reason Mark's (3) wins, American parts"

idx_tag_token =[(0, 'O', 'At'),
                (3, 'GPE-B', 'London'),
                (9, 'O', ','),
                (11, 'DATE-B', 'the'),
                (15, 'DATE-I', '12th'),
                (20, 'O', 'in'),
                (23, 'DATE-B', 'February'),
                (31, 'DATE-I', ','),
                (33, 'DATE-I', '1942'),
                (37, 'O', ','),
                (39, 'O', 'and'),
                (43, 'O', 'for'),
                (47, 'O', 'that'),
                (52, 'O', 'that'),
                (57, 'O', 'reason'),
                (64, 'PERSON-B', 'Mark'),
                (68, 'O', "'s"),
                (71, 'O', '('),
                (72, 'O', '3'),
                (73, 'O', ')'),
                (75, 'O', 'wins'),
                (79, 'O', ','),
                (81, 'NORP-B', 'American'),
                (90, 'O', 'parts')]

我的代码是:

def find_word_from_index(idx, word_string):
    words = word_string.split()
    current_index = 0

    for word in words:
        start_index = current_index
        end_index = current_index + len(word) - 1
        if start_index <= idx <= end_index:
            return word
        current_index = end_index + 2
    return None


word_tag_list = []
for index, tag, _ in idx_tag_token:
    word = find_word_from_index(index, word_string)
    word_tag_list.append((word, tag))
word_tag_list

当前输出:

[('At', 'O'),
 ('London,', 'GPE-B'),
 ('London,', 'O'),
 ('the', 'DATE-B'),
 ('12th', 'DATE-I'),
 ('in', 'O'),
 ('February,', 'DATE-B'),
 ('February,', 'DATE-I'),
 ('1942,', 'DATE-I'),
 ('1942,', 'O'),
 ('and', 'O'),
 ('for', 'O'),
 ('that', 'O'),
 ('that', 'O'),
 ('reason', 'O'),
 ("Mark's", 'PERSON-B'),
 ("Mark's", 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('wins,', 'O'),
 ('wins,', 'O'),
 ('American', 'NORP-B'),
 ('parts', 'O')]

请注意,情况并非如此.

[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]

推荐答案

try :

def get_tokens(tokens):
    it = iter(tokens)
    _, token_type, next_token = next(it)
    word = yield
    while True:
        if next_token == word:
            word = yield next_token, token_type
            _, token_type, next_token = next(it)
        else:
            _, _, tmp = next(it)
            next_token += tmp

it = get_tokens(idx_tag_token)
next(it)
out = [it.send(w) for w in word_string.split()]

print(out)

打印:

[
    ("At", "O"),
    ("London,", "GPE-B"),
    ("the", "DATE-B"),
    ("12th", "DATE-I"),
    ("in", "O"),
    ("February,", "DATE-B"),
    ("1942,", "DATE-I"),
    ("and", "O"),
    ("for", "O"),
    ("that", "O"),
    ("that", "O"),
    ("reason", "O"),
    ("Mark's", "PERSON-B"),
    ("(3)", "O"),
    ("wins,", "O"),
    ("American", "NORP-B"),
    ("parts", "O"),
]

Python相关问答推荐

通过交换 node 对链接列表进行 Select 排序

在Pandas框架中截短至固定数量的列

根据给定日期的状态过滤查询集

如何根据另一列值用字典中的值替换列值

Pandas 填充条件是另一列

Python会扔掉未使用的表情吗?

使用plotnine和Python构建地块

根据不同列的值在收件箱中移动数据

将数据框架与导入的Excel文件一起使用

Polars:用氨纶的其他部分替换氨纶的部分

如何获得每个组的时间戳差异?

在np数组上实现无重叠的二维滑动窗口

如何从数据库上传数据到html?

在ubuntu上安装dlib时出错

实现神经网络代码时的TypeError

如何使用Numpy. stracards重新编写滚动和?

基于形状而非距离的两个numpy数组相似性

Flask Jinja2如果语句总是计算为false&

比Pandas 更好的 Select

仅取消堆叠最后三列