Python try 理解PyTorch运行错误：try 再次向后遍历图表

发布于02月26日

我对神经网络和机器学习仍然是新手，我在理解我在PyTorch中遇到的问题以及如何解决它方面存在困难.

我的数据集，一旦存储在inputs和outputs中，就是一个inputs00x125x6数组，表示6个时间依赖变量和125个时间步长，有inputs00个独立集合.我试图用下面的代码来模拟这一点，但在向后计算渐变时出现错误.我看到了涉及detach()‘ing或在model_opt.step()之后插入model_opt.zero_grad()的答案；然而，我并不是真的理解发生了什么，才知道这些是不是正确的解决方案(或者如何让它们发挥作用)，并正在寻求更多的澄清和帮助.

只是为了澄清我的代码打算做什么:在train()00内，我手动地将train()个独立集合中的train()个批次分组.从每一批中，我得到损失，将其与该时期以来所有批次的损失相加，然后计算该时期到目前为止的平均损失.然后，我使用这个平均损失来更新优化器.

下面是一个最小的可重现的例子:

from pathlib import Path
import numpy as np
import h5py
import torch
from torch import nn
from torch import optim


class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes, num_layers, batch_size=1):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.num_layers = num_layers
        
        self.rnn = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x, device):
        hidden = torch.zeros(self.num_layers, self.batch_size, self.hidden_size, dtype=torch.float64).to(device)
        cell_state = torch.zeros(self.num_layers, self.batch_size, self.hidden_size, dtype=torch.float64).to(device)

        output, (hidden, cell_state) = self.rnn(x, (hidden, cell_state))

        output = self.fc(output)
        return output, hidden, cell_state


def train(epochs, rnn_model, model_loss, model_opt, inputs, outputs, batch_size, device):
    for epoch in range(epochs):
        rnn_model.train()
        model_opt.zero_grad()
        total_loss = 0.0
        num_batches = np.ceil(inputs.shape[0]/batch_size)
        for batch_i in range(int(num_batches)):
            start = batch_i*batch_size
            if batch_i == num_batches - 1:
                end = inputs.shape[0]
            else:
                end = (batch_i+1)*batch_size
            inp = inputs[start:end, :, :]
            target = outputs[start:end, :, :]

            out, hidden, cell_state = rnn_model(inp, device)
            total_loss += model_loss(out, target)

            loss = total_loss/end
            loss.backward()
            model_opt.step()
    return


def main(fname, input_size, hidden_size, output_size, num_layers, batch_size, num_epochs, learning_rate):
    data_dir = Path(r'path\to\my\data')

    # load data
    train_file = data_dir / f'NN_{fname}.h5'
    f = h5py.File(train_file, 'r')
    inputs = np.swapaxes(np.array(f['series']['input']), 0, 2)
    outputs = np.swapaxes(np.array(f['series']['output']), 0, 2)

    # Define model, optimizer, and loss
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = RNN(input_size, hidden_size, output_size, num_layers, batch_size=batch_size).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_func = nn.MSELoss()

    # send data to computation device
    inputs = torch.from_numpy(inputs).to(device)
    outputs = torch.from_numpy(outputs).to(device)
    
    # pre-training
    i_temp = inputs[:, range(25), :]
    o_temp = outputs[:, range(25), :]
    train(int(num_epochs*0.01), model, loss_func, optimizer, i_temp, o_temp, batch_size, device)
    return


if __name__ == '__main__':
    torch.set_default_dtype(torch.float64)
    input_size = 6
    hidden_size = 7
    output_size = 6
    num_epochs = 2500
    batch_size = 100
    learning_rate = 0.0001
    num_layers = 3
    f_name = 'data'
    main(f_name, input_size, hidden_size, output_size, num_layers, batch_size, num_epochs, learning_rate)

下面的错误是从train()内的loss.backward()产生的

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

编辑:我已经在model_opt.step()之后立即添加了total_loss = total_loss.detach()，现在它运行起来没有错误.然而，我仍然想知道，根据我上面所述的意图，这是否正确.

Python try 理解PyTorch运行错误：try 再次向后遍历图表

推荐答案

Python相关问答推荐

Python json.转储包含一些UTF-8字符的二元组，要么失败，要么转换它们.我希望编码字符按原样保留

处理(潜在)不断增长的任务队列的并行/并行方法

什么相当于pytorch中的numpy累积ufunc

在Polars(Python库)中将二进制转换为具有非UTF-8字符的字符串变量

如何过滤包含2个指定子字符串的收件箱列名？

Pandas计数符合某些条件的特定列的数量

python中的解释会在后台调用函数吗？

实现神经网络代码时的TypeError

如何在FastAPI中为我上传的json文件提供索引ID？

如何在BeautifulSoup/CSS Select 器中处理regex？

寻找Regex模式返回与我当前函数类似的结果

numpy.unique如何消除重复列？

Python—转换日期：价目表到新行

pandas fill和bfill基于另一列中的条件

BeautifulSoup：超过24个字符(从a到z)的迭代失败：降低了首次深入了解数据集的复杂性：

如何过滤组s最大和最小行使用`transform`'

mdates定位器在图表中显示不存在的时间间隔

在电影中向西北方向对齐""

如何在Python中自动创建数字文件夹和正在进行的文件夹？

在任何要保留的字段中添加引号的文件，就像在Pandas 中一样