聚合国内IT技术精华文章,分享IT技术精华,帮助IT从业人士成长

Using PyTorch on ClearLinux docker image

2022-06-22 09:57 浏览: 2210427 次 我要评论(0 条) 字号:

I am using Nvidia’s official docker image of PyTorch for my model training for quite a long time. It works very well but the only problem is that the image is too large: more than 6GB. In my poor home network, it would cost a painfully long time to download.

Yesterday, an interesting idea jumped out to my mind: why not build my own small docker image to use PyTorch? Then I started to do it.

Firstly, I chose ClearLinux from Intel since it is very clean and used all state-of-art software (which means its performance is fabulous).

I used distrobox to create my environment:

distrobox create --image clearlinux:latest 
    --name robin_clear 
    --home /home/robin/clearlinux 
    --additional-flags "--shm-size=4g" 
    --additional-flags "--gpus all" 
    --additional-flags "--device=/dev/nvidiactl" 
    --additional-flags "--device=/dev/nvidia0"

Enter the environment:

distrobox enter robin_clear

Download CUDA-11.03 run file and install it in the robin_clear:

sudo ./cuda_11.3.0_465.19.01_linux.run 
        --toolkit 
        --no-man-page 
        --override 
        --silent

Then, the important part: install gcc-10 (ClearLinux included gcc-12, which is too high for CUDA-11.03) and create the symbol links for it

sudo swupd bundle-add c-extras-gcc10
sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++

Install the PyTorch:

sudo swupd bundle-add python3-basic

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Install the apex for mixed-precision training (because my model training is using it):

git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Now I can run my training in the ClearLinux. The comparison of these two docker images is here:

CUDA VersionPyTorch VersionDocker image sizeVRAM UsageTime for training one batch
Nvidia Official PyTorch Image11.71.12.0223 MB10620MB0.2725 seconds
My ClearLinux PyTorch Image11.31.11.06.38 GB10936MB0.3066 seconds

Looks like Nvidia works better than me 
<br>
</p>
<script>

var contentimgs=document.getElementById(


网友评论已有0条评论, 我也要评论

发表评论

*

* (保密)

Ctrl+Enter 快捷回复