Issue
I trained an ALEBF-based model on python, and I chose to reason it in c++ for overall efficiency reasons. I chose torch.jit.trace in python to save the model and loaded the corresponding .pt file in c++. However, I encountered the problem in the title when the model was reasoning.
First is my c++ code:
if (torch::cuda::is_available()) {
n_model = torch::jit::load("/home/lzh/Storage4/lzh/deepmodel/model_scripted.pt",torch::kCUDA);
std::cout << torch::cuda::device_count() << std::endl;
} else {
std::cerr << "No CUDA devices available, cannot move model to GPU." << std::endl;
}
torch::Tensor inputs = torch::from_blob(fre, {1, 4,300, 201}, torch::kFloat).to(torch::kCUDA);
std::cout << inputs.device() << std::endl;
textInput.input_ids.to(torch::kCUDA);
textInput.attention_mask.to(torch::kCUDA);
torch::Tensor out_tensor = n_model.forward({inputs,textInput.input_ids,textInput.attention_mask}).toTensor();
The question arises:
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/models/model_somatic.py", line 14, in forward
cls_head = self.cls_head
ALBEF = self.ALBEF
_0 = (ALBEF).forward(image, input_ids, attention_mask, )
~~~~~~~~~~~~~~ <--- HERE
return (cls_head).forward(_0, )
class ALBEF(Module):
File "code/__torch__/models/model_somatic.py", line 35, in forward
_5 = torch.ones([_3, int(_4)], dtype=4, layout=None, device=torch.device("cpu"), pin_memory=False)
encoder_attention_mask = torch.to(_5, dtype=4, layout=0, device=torch.device("cpu"))
_6 = (text_encoder).forward(input_ids, attention_mask, _1, encoder_attention_mask, )
~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_7 = torch.slice(_6, 0, 0, 9223372036854775807)
input = torch.slice(torch.select(_7, 1, 0), 1, 0, 9223372036854775807)
File "code/__torch__/models/xbert.py", line 19, in forward
cls = self.cls
bert0 = self.bert
_0 = (bert0).forward(input_ids, attention_mask, argument_3, encoder_attention_mask, )
~~~~~~~~~~~~~~ <--- HERE
_1 = (cls).forward(weight, _0, )
return _0
File "code/__torch__/models/xbert.py", line 50, in forward
_8 = torch.to(encoder_extended_attention_mask, 6)
attention_mask1 = torch.mul(torch.rsub(_8, 1.), CONSTANTS.c3)
_9 = (embeddings).forward(input_ids, input, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
_10 = (encoder).forward(_9, attention_mask0, argument_3, attention_mask1, )
return _10
File "code/__torch__/models/xbert.py", line 78, in forward
input0 = torch.slice(_12, 1, 0, _11)
_13 = (word_embeddings).forward(input_ids, )
_14 = (token_type_embeddings).forward(input, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
embeddings = torch.add(_13, _14)
_15 = (position_embeddings).forward(input0, )
File "code/__torch__/torch/nn/modules/sparse/___torch_mangle_164.py", line 10, in forward
input: Tensor) -> Tensor:
weight = self.weight
return torch.embedding(weight, input)
~~~~~~~~~~~~~~~ <--- HERE
Traceback of TorchScript, original code (most recent call last):
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/functional.py(2044): embedding
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/sparse.py(158): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(207): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(1046): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(1400): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/model_somatic.py(47): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/model_somatic.py(90): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/jit/_trace.py(958): trace_module
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/jit/_trace.py(741): trace
/home/lzh/ALBEF/checkpoint.py(46): main
/home/lzh/ALBEF/checkpoint.py(76): <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Strangely enough, I load the corresponding file in python and I also get this problem.
image = torch.rand(16,4,300,201)
text1 = torch.rand(16,25).long()
text2 = torch.rand(16, 25).long()
traced_script_module = torch.jit.trace(model, (image,text1,text2))
traced_script_module.save('model_scripted.pt')
device=torch.device("cuda:0")
text = torch.ones((1,25))
text = text.long().to(device)
image = torch.ones((1,4,300,201)).to(device)
model = torch.jit.load('model_scripted.pt', map_location=torch.device('cuda'))
model.eval()
for param in model.parameters():
if param.device.type == 'cuda':
print('cuda')
print(image.device)
print(text.device)
out = model(image,text,text)
The outputs of the parameters are cuda and cuda:0.The error output is the same as c++. I used the method mentioned in the link to load the model on the gpu in my code, but it still doesn't work. text What should I do? This has been bothering me for a long time.
Solution
I solved the problem by first checking that the model code did not specify the device to create the tensor; then when saving, it was solved by putting the code on the cuda before saving the model.
model.to(device)
image = torch.rand(1,4,300,201).to(device)
text1 = torch.rand(1,25).long().to(device)
text2 = torch.rand(1, 25).long().to(device)
traced_script_module = torch.jit.trace(model, (image,text1,text2))
Answered By - LZH
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.