Issue
I am trying to build autoencoder model, where input/output is RGB images with size of 256 x 256. I tried to train model on 1 GPU with 12 GB of memory but I always caught CUDA OOM (I tried differen batchsizes and even batch size of 1 is failing). So I read about model parallelism in Pytorch and tried this:
class Autoencoder(nn.Module):
def __init__(self, input_output_size):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_output_size, 1024),
nn.ReLU(True),
nn.Linear(1024, 200),
nn.ReLU(True)
).cuda(0)
self.decoder = nn.Sequential(
nn.Linear(200, 1024),
nn.ReLU(True),
nn.Linear(1024, input_output_size),
nn.Sigmoid()).cuda(1)
print(self.encoder.get_device())
print(self.decoder.get_device())
def forward(self, x):
x = x.cuda(0)
x = self.encoder(x)
x = x.cuda(1)
x = self.decoder(x)
return x
So I have moved my encoder and decoder on different GPUs. But now I get this exception:
Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 0 does not equal 1 (while checking arguments for addmm)
It appear when I do x = x.cuda(1) in forward method.
Moreover, here is my "train" code, maye you can give me some advices about optimizations? Is images of 3 x 256 x 256 too large for training? (I cannot reduce them). Thank you in advance.
Training:
input_output_size = 3 * 256 * 256
model = Autoencoder(input_output_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
for epoch in range(100):
epoch_loss = 0
for batch_idx, (images, _) in enumerate(dataloader):
images = torch.flatten(images, start_dim=1).to(device)
output_images = model(images).to(device)
train_loss = criterion(output_images, images)
train_loss.backward()
optimizer.step()
if batch_idx % 5 == 0:
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
test_loss = criterion(pred, test_set)
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
del pred, test_loss
if batch_idx % 200 == 0:
# here I send testing images from output to W&B
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
wandb.log({"PRED": [wandb.Image((pred[i].cpu().reshape((3, 256, 256)).permute(1, 2, 0) * 255).numpy().astype(np.uint8), caption=str(i)) for i in range(20)]})
del pred
gc.collect()
torch.cuda.empty_cache()
epoch_loss += train_loss.item()
del output_images, train_loss
epoch_loss = epoch_loss / len(dataloader)
wandb.log({"Epoch MSE train": epoch_loss})
del epoch_loss
Solution
Three issues that I'm seeing:
model(test_set)
This is when you are sending the entirety of your test set (presumably huge) as a single batch through your model.
I don't know what wandb
is, but another likely source of memory growth is these lines:
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
You seem to be saving train_loss
and test_loss
, but these contain not only the numbers themselves, but the computational graphs (living on the GPU) needed for backprop. Before saving them, you want to convert them into float
or numpy
.
Your model contains two 3*256*256 x 1024
weight blocks. When used in Adam, these will require 3*256*256 x 1024 * 3 * 4 bytes
= 2.25GB of VRAM each (Possibly more, if it's inefficiently implemented) This looks like a poor architecture for other reasons also.
Answered By - MWB
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.