我正在try 使idx_tag_token
与原始字符串word_string
保持一致,idx_tag_token
是包含标记化字符串及其标签和字符索引的元组列表.我想输出一个元组列表,每个元组包含原始字符串的一个元素(如果使用空格拆分),以及idx_tag_token
的标签信息.
我已经编写了一些代码,可以根据字符索引在word_string
中查找与标记相关联的单词.然后,我创建了一个包含每个单词和相关标签的元组列表.这被定义为word_tag_list
.然而,基于此,我不确定如何继续创建所需的输出.
更新标签的条件并不复杂,但我在这里想不出合适的系统.
任何帮助都将不胜感激.
数据:
word_string = "At London, the 12th in February, 1942, and for that that reason Mark's (3) wins, American parts"
idx_tag_token =[(0, 'O', 'At'),
(3, 'GPE-B', 'London'),
(9, 'O', ','),
(11, 'DATE-B', 'the'),
(15, 'DATE-I', '12th'),
(20, 'O', 'in'),
(23, 'DATE-B', 'February'),
(31, 'DATE-I', ','),
(33, 'DATE-I', '1942'),
(37, 'O', ','),
(39, 'O', 'and'),
(43, 'O', 'for'),
(47, 'O', 'that'),
(52, 'O', 'that'),
(57, 'O', 'reason'),
(64, 'PERSON-B', 'Mark'),
(68, 'O', "'s"),
(71, 'O', '('),
(72, 'O', '3'),
(73, 'O', ')'),
(75, 'O', 'wins'),
(79, 'O', ','),
(81, 'NORP-B', 'American'),
(90, 'O', 'parts')]
我的代码是:
def find_word_from_index(idx, word_string):
words = word_string.split()
current_index = 0
for word in words:
start_index = current_index
end_index = current_index + len(word) - 1
if start_index <= idx <= end_index:
return word
current_index = end_index + 2
return None
word_tag_list = []
for index, tag, _ in idx_tag_token:
word = find_word_from_index(index, word_string)
word_tag_list.append((word, tag))
word_tag_list
当前输出:
[('At', 'O'),
('London,', 'GPE-B'),
('London,', 'O'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('February,', 'DATE-I'),
('1942,', 'DATE-I'),
('1942,', 'O'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
("Mark's", 'O'),
('(3)', 'O'),
('(3)', 'O'),
('(3)', 'O'),
('wins,', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]
请注意,情况并非如此.
[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]