我对神经网络和机器学习仍然是新手,我在理解我在PyTorch中遇到的问题以及如何解决它方面存在困难.
我的数据集,一旦存储在inputs
和outputs
中,就是一个inputs
00x125x6数组,表示6个时间依赖变量和125个时间步长,有inputs
00个独立集合.我试图用下面的代码来模拟这一点,但在向后计算渐变时出现错误.我看到了涉及detach()
‘ing或在model_opt.step()
之后插入model_opt.zero_grad()
的答案;然而,我并不是真的理解发生了什么,才知道这些是不是正确的解决方案(或者如何让它们发挥作用),并正在寻求更多的澄清和帮助.
只是为了澄清我的代码打算做什么:在train()
00内,我手动地将train()
个独立集合中的train()
个批次分组.从每一批中,我得到损失,将其与该时期以来所有批次的损失相加,然后计算该时期到目前为止的平均损失.然后,我使用这个平均损失来更新优化器.
下面是一个最小的可重现的例子:
from pathlib import Path
import numpy as np
import h5py
import torch
from torch import nn
from torch import optim
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, num_classes, num_layers, batch_size=1):
super(RNN, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_classes = num_classes
self.batch_size = batch_size
self.num_layers = num_layers
self.rnn = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x, device):
hidden = torch.zeros(self.num_layers, self.batch_size, self.hidden_size, dtype=torch.float64).to(device)
cell_state = torch.zeros(self.num_layers, self.batch_size, self.hidden_size, dtype=torch.float64).to(device)
output, (hidden, cell_state) = self.rnn(x, (hidden, cell_state))
output = self.fc(output)
return output, hidden, cell_state
def train(epochs, rnn_model, model_loss, model_opt, inputs, outputs, batch_size, device):
for epoch in range(epochs):
rnn_model.train()
model_opt.zero_grad()
total_loss = 0.0
num_batches = np.ceil(inputs.shape[0]/batch_size)
for batch_i in range(int(num_batches)):
start = batch_i*batch_size
if batch_i == num_batches - 1:
end = inputs.shape[0]
else:
end = (batch_i+1)*batch_size
inp = inputs[start:end, :, :]
target = outputs[start:end, :, :]
out, hidden, cell_state = rnn_model(inp, device)
total_loss += model_loss(out, target)
loss = total_loss/end
loss.backward()
model_opt.step()
return
def main(fname, input_size, hidden_size, output_size, num_layers, batch_size, num_epochs, learning_rate):
data_dir = Path(r'path\to\my\data')
# load data
train_file = data_dir / f'NN_{fname}.h5'
f = h5py.File(train_file, 'r')
inputs = np.swapaxes(np.array(f['series']['input']), 0, 2)
outputs = np.swapaxes(np.array(f['series']['output']), 0, 2)
# Define model, optimizer, and loss
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = RNN(input_size, hidden_size, output_size, num_layers, batch_size=batch_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
loss_func = nn.MSELoss()
# send data to computation device
inputs = torch.from_numpy(inputs).to(device)
outputs = torch.from_numpy(outputs).to(device)
# pre-training
i_temp = inputs[:, range(25), :]
o_temp = outputs[:, range(25), :]
train(int(num_epochs*0.01), model, loss_func, optimizer, i_temp, o_temp, batch_size, device)
return
if __name__ == '__main__':
torch.set_default_dtype(torch.float64)
input_size = 6
hidden_size = 7
output_size = 6
num_epochs = 2500
batch_size = 100
learning_rate = 0.0001
num_layers = 3
f_name = 'data'
main(f_name, input_size, hidden_size, output_size, num_layers, batch_size, num_epochs, learning_rate)
下面的错误是从train()
内的loss.backward()
产生的
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
编辑:我已经在model_opt.step()
之后立即添加了total_loss = total_loss.detach()
,现在它运行起来没有错误.然而,我仍然想知道,根据我上面所述的意图,这是否正确.