在标记标记任务中,我使用transformers标记器,它输出BatchEncoding类的对象.
以下是代码的主要部分:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
def extract_labels(raw_text):
# split text into words and extract label
(...)
return clean_words, labels
def tokenize_text(words, labels):
# tokenize text
tokens = tokenizer(words, is_split_into_words=True, padding='max_length', truncation=True, max_length=MAX_LENGTH)
# since words might be split into subwords, labels need to be re-arranged
# only the first subword has the label
(...)
tokens['labels'] = label_ids
return tokens
tokens = []
for raw_text in data:
clean_text, l = extract_labels(raw_text)
t = tokenize_text(clean_text, l)
tokens.append(t)
type(tokens[0])
# transformers.tokenization_utils_base.BatchEncoding
tokens[0]
# {'input_ids': [101, 69887, 10112, ..., 0, 0, 0], 'attention_mask': [1, 1, 1, ... 0, 0, 0], 'labels': [-100, 0, -100, ..., -100, -100, -100]}
Update,正如所问,一个基本的例子来重现:
from transformers import BertTokenizerFast
import tensorflow as tf
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')
tokens = []
for text in ["Hello there", "Good morning"]:
t = tokenizer(text.split(), is_split_into_words=True, padding='max_length', truncation=True, max_length=10)
t['labels'] = list(map(lambda x: 1, t.word_ids())) # fake labels to simplify example
tokens.append(t)
print(type(tokens[0])) # now tokens is a list of BatchEncodings
print(tokens)
如果我直接标记整个数据集,我会有一个包含所有内容的BatchEncoding,但我无法处理标签:
data = ["Hello there", "Good morning"]
tokens = tokenizer(data, padding='max_length', truncation=True, max_length=10)
# now tokens is a batch encoding comprising all the dataset
print(type(tokens))
print(tokens)
# This way I can get a tf dataset like this:
tf.data.Dataset.from_tensor_slices(tokens)
注意,我需要首先迭代文本以获得标签,并且需要每个文本的word\u id()来重新排列标签.