kgTransformer复现踩过的坑一、官方代码：二安装之路：三：运行报错四：新的开局五重开一局这次真的是最后一局六我也没想到又重头再来了

219 阅读 0 评论 145 点赞

我是靠谱客的博主烂漫寒风，这篇文章主要介绍kgTransformer复现踩过的坑一、官方代码：二安装之路：三：运行报错四：新的开局五重开一局这次真的是最后一局六我也没想到又重头再来了，现在分享给大家，希望可以做个参考。

ip在90服务器上，有3个cuda版本：11.1，11.3和11.6，但大环境下默认11.6版本。

一、官方代码：

https://github.com/THUDM/kgTransformer

官方安装建议：

二安装之路：

我个人的默认环境里是pytorch1.9.0，报错：

nccl.h：没有那个文件或目录

我重新安装了pytorch1.8，也依然报该错误。

仔细查看后，通过nvcc --version确认本地环境是10.1的cuda，并不是系统里的环境，因此需要在bashrc里加入新的环境，如图所示。注意，按从左到右的顺序，默认版本11.3。

然后此时和官方要求的11.3版本一致。接下来为了保证环境的干净，从repo拉取新的源下来

conda create -n kgTransformer python=3.7

(进入虚拟环境后：)

pip install torch==1.9.0

cd pytorch-geometric && pip install -e .

#官方解决问题的办法：

conda install cudatoolkit-dev=11.3 "gxx_linux-64<=10" nccl -c conda-forge -y

安装成功了fastmoe

三：运行报错

RuntimeError: Detected that PyTorch and torch_sparse were compiled with different CUDA versions. PyTorch has CUDA version 10.2 and torch_sparse has CUDA version 11.3. Please reinstall the torch_sparse that matches your PyTorch install.

代码要求的1.7~1.9的torch版本不支持cuda11.3，这也是为啥需要前面nvcc强制安装库的原因。此时我重新降低了torch-sparse版本，仍然重复以上结果。

索性重开一局，所有东西都按cu111版本来，而且为了避免从conda已安装的内容中下载依赖，选择每次下载新内容。

四：新的开局

终于不再报错版本问题了，但是来了新的报错：

SyntaxError: invalid syntax

这是因为python3.8及以上才支持这个语法，换成3.8后安装

pip install torch==1.9.0 torch-scatter==2.0.9 torch-sparse==0.6.12 -f https://pytorch-eometric.com/whl/torch-1.9.0%2Bcu111.html

pip install torch-geometric==1.7.2

报错

参考了OSError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory（已解决）_汉秋_的博客-CSDN博客

而我的版本正是：

五重开一局这次真的是最后一局

pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/cu111/torch/

pip install torch-scatter==2.0.9 -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html

pip install torch-sparse==0.6.12 -f https://pytorch-geometric.com/whl/torch-1.9.0+cu111.html

pip install torch-geometric==1.7.2

解决！~

六我也没想到又重头再来了

在五之后，我发现fastmoe又安不上了，报错见：

于是我重开了新服务器，新环境，cuda版本是11.0，我安的python虚拟环境是3.8

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/cu110/torch/

pip install torch-scatter==2.0.7 torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.7.1%2Bcu110.html

pip install -U -i https://pypi.tuna.tsinghua.edu.cn/simple torch_geometric

git submodule update --init

pip install -e .

这是老服务器，运行时报错

RuntimeError: Shared memory manager connection has timed out at……

内存不够了：减少numworkers或者修改虚拟内存大小。

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 1; 10.92 GiB total capacity; 9.69 GiB already allocated; 27.31 MiB free; 10.01 GiB reserved in total by PyTorch)

是显存不够了，减少batchsize到=2，仍然崩；为排除错误，指定os.environ["CUDA_VISIBLE_DEVICES"] = "0",确定不是因为并行写错引起。得，我乖乖打dockers去了。

七、球球了docker，给点面子吧

sudo docker pull azraelkuan/pytorch1.7.1-hvd-apex-py38-cuda11.0-cudnn8

sudo docker container run -it azraelkuan/pytorch1.7.1-hvd-apex-py38-cuda11.0-cudnn8:latest /bin/bash

root@49555254b199:/mnt# python --version
Python 3.8.5
root@49555254b199:/mnt# pip list

pip install torch-scatter==2.0.7 torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.7.1%2Bcu110.html

pip install -U -i https://pypi.tuna.tsinghua.edu.cn/simple torch_geometric

root@49555254b199:/mnt# git clone https://github.com/THUDM/kgTransformer
Cloning into 'kgTransformer'...
remote: Enumerating objects: 31, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 31 (delta 2), reused 3 (delta 0), pack-reused 13
Unpacking objects: 100% (31/31), done.
root@49555254b199:/mnt# ls
kgTransformer
root@49555254b199:/mnt# cd kgTransformer/
root@49555254b199:/mnt/kgTransformer# ls
configs       deter_util.py  graph_util.py  LICENSE  metric.py  README.md   tasks
data_util.py  fastmoe        __init__.py    main.py  model.py   sampler.py  train.py
root@49555254b199:/mnt/kgTransformer# git submodule update --init
Submodule 'fastmoe' (https://github.com/laekov/fastmoe) registered for path 'fastmoe'
Cloning into '/mnt/kgTransformer/fastmoe'...
Submodule path 'fastmoe': checked out 'b652e8d88bbe82171fa28a479f21ad1263fb2e1c'
root@49555254b199:/mnt/kgTransformer# cd fastmoe/
root@49555254b199:/mnt/kgTransformer/fastmoe# pip install -e .
Obtaining file:///mnt/kgTransformer/fastmoe
Installing collected packages: fastmoe
  Running setup.py develop for fastmoe
    ERROR: Command errored out with exit status 1:

我索性在该目录下重写了docker file

ROM azraelkuan/pytorch1.7.1-hvd-apex-py38-cuda11.0-cudnn8
WORKDIR /dockerdir
COPY . .
RUN pip install torch-scatter==2.0.7 torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.7.1%2Bcu110.html 
        && pip install -U -i https://pypi.tuna.tsinghua.edu.cn/simple torch_geometric

由于跑起来需要gpu，因此运行命令为：

sudo docker run -it --gpus all kgtransformer:1018