This is tutorial (learning record) of implementing pytorch on cuda in Ubuntu18-server system (remote PC (3080+2CPU+4GB_RAM))
- Platform Info
- Source switching for ubuntu18 (optional)
- Mounting extra disk space (optional)
- SSH remote login through FRP (Fast Reverse Proxy) or ZeroTier (recommended, optional)
- Install NVIDIA driver
- Install CUDA
- Add cuDNN plugins
- Install anaconda3
- Replace conda's download source (optional)
- Replace pip installation source (optional)
- Replace some download sources in Python packages & Implement convenient tools (optional)
- Some modifications and debuggings when deploying to the virtual machine
This tutorial is based on the cloud computer (Ubuntu-18-server-image) with 2-core-4-GHz 4G-RAM, 3080-GPU (10G), thanks USTC (University of Science and Technology of China)'s CENI providing the source.
Back up the source.list first (optional)
sudo cp /etc/apt/sources.list some_where_you_want
Replace the content of source.list by sudo nano /etc/apt/sources.list
with (ctl+a
to save ctl+x
to exit):
deb https://mirrors.ustc.edu.cn/ubuntu/ bionic main restricted universe multiverse
deb-src https://mirrors.ustc.edu.cn/ubuntu/ bionic main restricted universe multiverse
deb https://mirrors.ustc.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
deb-src https://mirrors.ustc.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
deb https://mirrors.ustc.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
deb-src https://mirrors.ustc.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
deb https://mirrors.ustc.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
deb-src https://mirrors.ustc.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
deb https://mirrors.ustc.edu.cn/ubuntu/ bionic-proposed main restricted universe multiverse
deb-src https://mirrors.ustc.edu.cn/ubuntu/ bionic-proposed main restricted universe multiverse
- Note: For different versions of Ubuntu OS, the URL name has some slight differences in the source.list:
Ubuntu 22.04:jammy
Ubuntu 20.04:focal
Ubuntu 18.04:bionic
Ubuntu 16.04:xenial
Update necessities:
sudo apt-get update
sudo apt-get upgrade
Here's my brief note on how Unix-based OS names, organizes, and initializes its filesystem:
Option 1: Through FRP (Fast Reverse Proxy) (recommended, optional):
Install ssh server: sudo apt install openssh-server
, and check the the validaty of the permission of ssh connection through password: nano /etc/ssh/sshd_config
with lines like Permit...
-> yes
.
Option 2: Through ZeroTier:
No third public IP server is required. Nodes require the zerotier controller to build the accessible path first between nodes, then nodes can directly communicate with each other.
ZeroTier Document: https://docs.zerotier.com/guides/
Preparation: Install necessities first:
sudo apt install build-essential dkms
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
Go to https://www.nvidia.com/Download/index.aspx, and find the driver that is compatible with your Nvidia GPU (in my case, it's GeForce RTX 3080). Then copy the link of .sh download (you may go through some agreements/acknowledge in the webpage), downloading by wget
. For 3080, it's
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run
Install the package:
sudo sh NVIDIA-Linux-x86_64-name_of_the_downloaded_file.run
Follow all the recommended options (there might be a disable of the original GPU kernel). To check the installation, use the command: nvidia-smi
.
Find the latest (recommended) version of CUDA that is compatible with your OS, for Ubuntu18, the latest version supported is v11.8, also find the .sh (runfile (local)) file download link, download it through (v11.8)
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
Install:
sudo sh cuda_11.8.0_name-of-the-downloaded-file_linux.run
Go through the options (only the CUDA-toolkit-related options, do not install the driver again). Add the CUDA path to the OS environment:
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
Modify the initialization file for the new-login terminal: sudo nano ~/.bashrc
, and add the above two lines at the end.
Verify the CUDA installation:
nvcc -V
Find the compatible cuDNN plugins at https://developer.download.nvidia.cn/compute/cudnn/redist/cudnn/linux-x86_64/, copy and download the compressed package (v9.20):
wget https://developer.download.nvidia.cn/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.2.0.82_cuda11-archive.tar.xz
Extract (-x) the specified tar file (-f):
tar -xf cudnn-linux-x86_64-9.2.0.82_cuda11-archive.tar.xz
Implement cuDNN plugins by superseding (adding) to CUDA files and then making them excutable:
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp cuda/lib/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
Verify the CUDA installation:
nvcc -V
- Sometimes, there may be errors raised with
Can't communicate with Nvidia drivers
, to solve this, you may try to install the driver again.
Download anaconda3 at https://www.anaconda.com/download/success, with
wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
To install (recommend in root
state):
sudo bash <conda-installer-name>-latest-Linux-x86_64.sh
Add anaconda3 to the OS environment path (in my case, it's under root
identity), then also add the two lines into ~/.bashrc
:
export ANACONDA=/root/anaconda3/
export PATH=$PATH:/root/anaconda3/bin
Then logout from the terminal, and log in again, to verify installation, just type conda
.
An official tutorial from PyTorch to install torch with its dependencies: https://pytorch.org/get-started/previous-versions/
9. Replace conda's download source (optional, this part is from a CSDN blog):
Replace with USTC's conda source:
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes
Show the current conda source:
conda config --show-sources
Delete a specified channel:
conda config --remove channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
Switch to the original channel:
conda config --remove-key channels
Temporary:
pip install -i https://pypi.mirrors.ustc.edu.cn/simple/ xxx(the_name_of_python_package)
Permanent:
pip config set global.index-url https://pypi.mirrors.ustc.edu.cn/simple/
- Verification:
pip config get global.index-url
- scp:
scp -P ### usr_name@xxx.xxx.xxx.xxx:/source_location/file_name usr_name@xxx.xxx.xxx.xxx:/destination
, where###
is the port number that is exposed from the FRP server. - git:
sudo apt install git-all
- gpustat:
pip install gpustat
(gpustat -cp --watch -i 1) - huggingface source switching (for China mainland):
make surehuggingface_hub
already installed bypip install -U huggingface_hub
switch the mirror source site:export HF_ENDPOINT=https://hf-mirror.com
(Linux) or$env:HF_ENDPOINT = "https://hf-mirror.com"
(Windows Powershell) - ...
- The machine failed to connect to the network, e.g.:
ping 8.8.8.8
with timeout:
this may be due to the impact of the relic/errors at the (past) ethernet card configuration, for detailed solutions: https://ubuntu.com/server/docs/configuring-networks
Solution: check with /etc/netplan/some_network_configuration_file, update it, then apply it bynetplan apply
(usually delete the old one(top one) and keep the new one (bottom one)). - Huggingface model download:
huggingface-cli download --resume-download [model_idx] --local-dir [path]
(original source)
export HF_ENDPOINT=https://hf-mirror.com
(switch CN source for Linux)$env:HF_ENDPOINT = "https://hf-mirror.com"
(CN source for Windows) from CSDN