我正试图微调T5-Small与Xsum数据集在pytorch Windows 10(CUDA 12.1)上.

遗憾的是,bitandbytes中的Trainer(或Seq2SeqTrainer)类在Windows上不可用,因此需要创建一个纪元循环:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, get_scheduler
from torch.utils.data import DataLoader
from torch.optim import AdamW
import torch
from tqdm.auto import tqdm

dataset = load_dataset("xsum")
MODEL_NAME = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

prefix = "summarize: "
max_input_length = 1024
max_target_length = 128

def tokenize_function(examples):
    
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['document', 'summary', 'id'])
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.save_pretrained("outputs/trained")

我收到了这样的错误:

RuntimeError: stack expects each tensor to be equal size, but got [352] at entry 0 and [930] at entry 1

我怎么才能解决这个问题呢?

推荐答案

"同等大小"?

tokenize_function中,您正在将输入截断为最大长度(max_input_length),并将目标截断为不同的最大长度(max_target_length).处理不同长度的文本数据的常见做法是在每批中将序列填充到一致的长度.判断是否可以将设置为True'longest'padding argument/token传递给标记器,以将批中的序列填充到相同的长度.

def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding='longest')

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True, padding='longest')

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

此外,确保DataLoader正确处理批次.在某些情况下,您可能需要定义一个定制的排序函数(如在this thread中),以确保正确地形成批,特别是在处理不同长度的文本数据时.


如果问题仍然存在,当模型try 处理批处理时,可能会出现张量大小不一致的情况.

一种可能的解决方案是try 创建一个定制的排序函数,该函数确保批中的所有张量在被提供给模型之前都被填充到相同的长度.在定制的COLLATE函数中,您可以使用标记器的padding功能将一批中的所有序列填充到最长序列的长度.

from torch.nn.utils.rnn import pad_sequence
from torch import nn

def custom_collate_fn(batch):
    inputs = [item['input_ids'] for item in batch]
    labels = [item['labels'] for item in batch]
    
    # Pad sequences within the batch
    padded_inputs = pad_sequence([seq.clone().detach() for seq in inputs], batch_first=True, padding_value=tokenizer.pad_token_id)
    padded_labels = pad_sequence([seq.clone().detach() for seq in labels], batch_first=True, padding_value=tokenizer.pad_token_id)
    
    # Create a new batch with padded sequences
    new_batch = {
        'input_ids': padded_inputs,
        'labels': padded_labels,
        'attention_mask': padded_inputs.ne(tokenizer.pad_token_id)
    }
    return new_batch

# rest of your code

# Use the custom collate function in your DataLoaders
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8, collate_fn=custom_collate_fn)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8, collate_fn=custom_collate_fn)

# rest of your code

在该定制排序函数custom_collate_fn中,来自PyTorch的pad_sequence function用于将input_idslabels张量填充到批次中最长序列的长度.attention_mask相应地被更新以指示实际令牌在哪里以及填充令牌在哪里.然后,将定制的COLLATE函数传递给DataLoader个实例的collate_fn个参数:这应该会使批中的所有张量在被提供给模型之前被填充到相同的长度.

Python相关问答推荐

带有Postgres的Flask-Data在调用少量API后崩溃

如何将uint 16表示为float 16

按日期和组增量计算总价值

Python在通过Inbox调用时给出不同的响应

将从Python接收的原始字节图像数据转换为C++ Qt QIcon以显示在QStandardProject中

使文本输入中的文本与标签中的文本相同

如何使用scipy从频谱图中回归多个高斯峰?

使用SciPy进行曲线匹配未能给出正确的匹配

时间序列分解

非常奇怪:tzLocal.get_Localzone()基于python3别名的不同输出?

如何使用html从excel中提取条件格式规则列表?

使用@ guardlasses. guardlass和注释的Python继承

修复mypy错误-赋值中的类型不兼容(表达式具有类型xxx,变量具有类型yyy)

如何在UserSerializer中添加显式字段?

在单个对象中解析多个Python数据帧

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

如何启动下载并在不击中磁盘的情况下呈现响应?

Django admin Csrf令牌未设置

剪切间隔以添加特定日期

如何找出Pandas 图中的连续空值(NaN)?