Using the GPU Server — A Complete Guide
Overview
The group GPU server, sim.csl.toronto.edu
, currently has 1 TB of NVMe SSD for its boot drive (where all home directories are located at), 12 TB of hard drive space for its data (located at /data
), 256 GB of memory, 40 CPU cores, and 3 NVIDIA RTX A4500 GPUs with 20 GB CUDA memory each.
For technical support, the point of contact is Baochun.
Access to Wireless Access Point in BA 4176 — iQua Network
Setting Up Virtual Environments for Python
House Rules
Here is an overview of the house rules. For more details, refer to the later sections in this guide.
-
Your account on
sim.csl.toronto.edu
is created for you and you only. Do not share your account with anyone. If your research collaborator needs a guest account, let me know and we’ll work something out. A shared account will be banned permanently. -
Before you run anything that takes more than 10 minutes, please use the
srun
or thesbatch
command to run it as a job. For example:
srun --time=3:00:00 -c 12 --gres=gpu:1 --mem=36G ./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato
will run a job with 12 CPU cores, and 36 GB of physical memory. Since the duration of the job is not specified, the default duration is 3 hours. Refer to the later sections of this guide for more detailed documentation.
- GPU jobs that request resources using either
srun
orsbatch
can use only the following three combinations:
-
1 GPU + 12 CPU cores + 36 GB memory (1 unit per hour, no time limit)
-
2 GPUs + 24 CPU cores + 72 GB memory (2 units per hour, up to 24 hours)
-
3 GPUs + 36 CPU cores + 108 GB memory (3 units per hour, up to 12 hours)
Do not request any other combinations of GPU + CPU cores + memory.
- To improve fairness while maximizing utilization, please use the following manual accounting policies based on an honour-based system:
In a shared file called /share/accounting.txt
, please enter three pieces of information after you submit your job:
<Your username>, the units you requested, your job ID
For example:
bli, 15, 3592
If I just submitted a job with 3592 as its job ID that used 3 units and lasts 5 hours.
-
If your project used up too many units on
sim
compared to the others andsim
is currently undergoing high demand, please make serious attempts of using Compute Canada, for example onnarval
. Recently, jobs onnarval
are typically scheduled within an hour. On weekends, it may be even faster. Please read the Complete Guide in the internal website before using Compute Canada. -
If you submit a CPU-only job, request no more than 4 CPU cores and 16 GB of memory. Your job should be able to run immediately.
-
If you run any job related to machine learning, all datasets and model checkpoints must be placed in your own directory in
/data
. If you run any other I/O-bound or I/O intensive job, please make sure that your files involved in these I/O-bound jobs are stored in your own directory in/data
. -
Please clean up your unused files under
/data/<your user name>
from time to time and make sure your usage is below 1 TB. The command to find out your total usage in/data
is:
du -hc /data/<your user name>
Logging in
To log in initially, use the command:
ssh <your username>@sim.csl.toronto.edu
To change your password initially after you log in, use the passwd
command:
passwd
To change your contact information so that others can view them using the finger
command, use the chfn
command:
bli@sim:~$ chfn
Password:
Changing the user information for bli
Enter the new value, or press ENTER for the default
Full Name: Baochun Li
Room Number []: BA 4118
Work Phone []: 416-946-7338
Home Phone []: 416-946-7338
bli@sim:~$
It is optional but if you add your phone number, other members in the group can just run the command finger bli
(where bli
is your username) to find it:
bli@sim:~$ finger bli
Login: bli Name: Baochun Li
Directory: /home/bli Shell: /bin/bash
Office: BA 4118, 416-946-7338 Home Phone: 416-946-7338
On since Sat Apr 9 13:42 (UTC) on pts/0 from 128.100.100.128
1 second idle
No mail.
No Plan.
bli@sim:~$
If you wish to change your login shell from the default /bin/bash
(see finger information above) to /bin/zsh
, you can use the chsh
command:
bli@sim:~$ chsh
Password:
Changing the login shell for bli
Enter the new value, or press ENTER for the default
Login Shell [/bin/bash]: /bin/zsh
bli@sim:~$
You can refer to .zshrc and .zprofile for an example .zshrc
and .zprofile
that Baochun is using.
If you wish to save some typing when you log into sim
, add an entry into your .ssh/config
on your personal computer running macOS:
~$ cat >>.ssh/config
Host sim
HostName sim.csl.toronto.edu
User bli
<Press Control+D to exit>
~$
Where bli
is your own username. You can use the command ssh sim
to log into sim
next time.
Last but not the least, if you wish to log into sim
without using the password, transfer your existing .ssh/id_rsa.pub
to sim
:
scp ~/.ssh/id_rsa.pub bli@sim.csl.toronto.edu:~
Password: (Enter your password)
$ ssh sim
Password: (Enter your password)
$ mkdir .ssh
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod -R go-rwx ~/.ssh
$ chmod 600 ~/.ssh/authorized_keys
$ rm id_rsa.pub
$ exit
If you do not have .ssh/id_rsa.pub
yet, refer to the Using Keychain in macOS for the necessary command to create one.
To store your private key into your keychain (in the physical memory) so that you do not need to type your passphrase either, you need to install and activate keychain
on your macOS:
brew install keychain
Then activate keychain in your .zshrc
by adding the following lines:
# Keychain access
eval `keychain --quiet --eval --agents ssh --inherit any id_rsa`
Refer to Using Keychain in macOS for more details on activating your keychain if you use bash
rather than zsh
.
Access to Wireless Access Point
In order to connect to the wireless access point in BA 4176, you will need to edit /share/known-hosts.conf
on sim
and provide your MAC address. For example:
host Baochun-M1-Macbook-Pro { hardware ethernet f6:4e:92:6f:68:34; }
host Baochun-15-Macbook-Pro { hardware ethernet 3c:25:eb:24:c8:15; }
Provide a descriptive name for your MAC address, so that we know which computer/iPad/phone the MAC address is associated with.
After you are done, let Baochun know and he will activate your MAC address. To get the password of the iQua network, ask any graduate student in the iQua lab or Baochun.
You can skip this step if you do not need to work in BA 4176.
Using tmux
For anything that interacts with a server, it is strongly recommended that you install and run tmux
on your macOS. tmux
allows your server session to be completely detached from your personal computer, so that even if you put your computer to sleep or disconnect your Internet connection, your server session will not be disrupted. To install tmux
on your macOS, use brew
:
brew install tmux
Refer to .tmux.conf for an example ~/.tmux.conf
that Baochun is using.
tmux
has also been installed on sim
. You can transfer your .tmux.conf
to the server to use it more effectively:
scp .tmux.conf bli@sim.csl.toronto.edu:~
Disk Quotas
Each user account has a soft per-user quota of 60 GB of space for its home directory, and a hard per-user space limit of 80 GB. If your usage is beyond the soft per-user quota, you have a grace period of 10 days to remove files and bring your usage below your quota. If your usage is above the hard limit, you will no longer be able to write to your home directory.
There are no quotas for /data
, but please keep your usage below 1 TB.
In order to check your usage of disk space for your home directory, use the quota
command:
(base) ~ $ quota bli
Disk quotas for user bli (uid 1000):
Filesystem blocks quota limit grace files quota limit grace
/dev/mapper/ubuntu--vg-ubuntu--lv
15197960 62914560 83886080 66568 0 0
The number under blocks indicates the current usage. For example, usage here is about 15GB.
Also, you can use command du -sh *
to check the detailed space used of each file/folder under current directory. For example:
(plato) dixi@sim:~$ du -sh *
2.2G data
8.4G miniforge3
241M plato
Setting Up Virtual Environments for Python
To run machine learning jobs with Python, you will need to set up virtual environments. Baochun uses miniforge
, a minimalist variant of miniconda
or anaconda
. To set it up, use the commands:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash ./Miniforge3-Linux-x86_64.sh
rm ./Miniforge3-Linux-x86_64.sh
After setting up miniforge
, the usual conda
commands can be used, such as:
To list all currently installed virtual environments:
conda env list
To create and activate a new virtual environment called plato
with Python 3.9:
conda create -n plato python=3.9
conda activate plato
To install PyTorch with CUDA 11.6 support in the new virtual environment:
pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116
The command above must be executed before you install any other packages in your virtual environment, such as using the pip install -r requirements.txt
command or using the pip install .
command. This ensures that you have the correct version of PyTorch installed to take advantage of the NVIDIA RTX A4500 GPUs on sim
.
Running Jobs Interactively
To run a new job interactively, use the srun
command. For example, the following command can be used to start a 1-hour job with 12 CPU cores, 1 GPU, and 36 GB of memory:
srun --time=3:00:00 -c 12 --gres=gpu:1 --mem=36G ./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato
If you do not specify the memory requirement, for each CPU core you allocate, the job will get 4096 MB of memory. If you do not specify the time duration, the default will be three hours.
Tip: If you wish to get the job started almost immediately, do not request GPU resources.
Submitting Batch Jobs
To submit a batch job, use the sbatch
command with a configuration file containing all the settings for the scheduler to run your job. The following example is a typical configuration:
#!/bin/bash
#SBATCH --time=3:00:00
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:2
#SBATCH --mem=72G
#SBATCH --output=<output_filename>.out
./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato
Important notes:
-
--time
: This is required, and specifies the maximum duration of time for the job to run. When the amount of time specified here expires, the job will be terminated. Keep this value as short as possible, since the shorter it is, the more likely your job can be scheduled early. Note that according to our house rules, each job can request no more than five hours. -
--cpus-per-task=24
: This represents the number of CPU cores you wish to request. If your job just needs one CPU, you do not need to add this option. -
--gres=gpu:2
: This represents the number of GPUs you wish to request. The maximum number is 3. Once your job has been successfully launched, it will only be able to access the GPU(s) allocated to it. -
--mem=72G
: The amount of main memory you wish to request. Keep this as low as you can. If you do not specify this option, 4096 MB will be requested for each CPU core you request using the--cpus-per-task
option. Keep in mind that if you do not request enough memory, your job will be terminated with anout-of-memory
error. -
--output=<filename>.out
: This sets the filename where the job’s output will be stored in. Using thelogging
facility in Python, rather thanprint()
, to log to this file.
Once the configuration file is available, use the sbatch
command to submit it:
sbatch <config_filename>.sh
After the job is submitted, use the squeue
command to check its status:
squeue
In its output below, ST
stands for STATUS
, R
means RUNNING
, and PD
means PENDING
:
(base) bli@sim:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
51 sim bash bli R 0:03 1 sim
To cancel a submitted job:
scancel <jobid>
Watching the job output live
Sometimes you may wish to monitor the job output as it is generated live, you can use the following command:
watch -n 1 tail -n 70 ./<log_output_filename>.out
Where the -n
parameter for watch
specifies the monitoring frequency in seconds (the default value is 2 seconds), and the -n
parameter for tail
specifies the number of lines at the end of the file to be shown. Type Control + C
to exit the watch
session.
In your job, you may wish to write checkpoints or other results into files, rather than depending entirely on logging output. However, if you do decide to write to files, make sure they are located in the /data
directory.
Checking the GPU Status
Use the following command to check the GPU status:
nvidia-smi
If you want to monitor the GPU status, use:
watch -n 1 nvidia-smi
to display the status and update it every 1 second.
Running Plato
To train the ResNet-18 or VGG-16 model, 20 GB CUDA memory can stably handle 7 processes at the same time. Therefore, in the configuration file of your FL training session, you could set max_concurrency
under trainer
to 7 to run your session at the fastest speed.
For an FL training session that selects 100 clients in each communication round, a recommended configuration of a batch job would be:
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=36
#SBATCH --gres=gpu:3
#SBATCH --mem=216G
#SBATCH --output=<output_filename>.out
./run -c configs/CIFAR10/fedavg_resnet18.yml -b /data/bli/plato
Such a configuration can let you take the most advantage of the resource on sim
, using all the three GPUs and running 21 clients at the same time. It will only take about 5 minutes to finish one round.
Please note that the memory usage gradually increases over time. So your request of main memory should depend on the number of communication rounds. In the above configuration, 216 GB main memory is enough for running 36 rounds.
Administrative notes
To create a new user, use the command:
sudo adduser <username>
sudo usermod -a -G iqua <username>
To set the quota for the new user, use the command (more details about file system quotas on Ubuntu 20.04):
sudo setquota -u <username> 60G 80G 0 0 /
To show how each user uses its disk quota:
sudo repquota -a
To add additional MAC addresses to DHCP, restart the DHCP service, and then check its status to confirm that the service is running:
sudo cp /share/known-hosts.conf /etc/dhcp
sudo systemctl restart isc-dhcp-server.service
sudo systemctl status isc-dhcp-server.service
To update the group external website:
cd /share/external-website; git pull; chmod -R go+rX .
To make sure that there are no syntax errors in any of the nginx
configuration files, and then restart the web server:
sudo nginx -t
sudo systemctl restart nginx
To change the state of sim
in Slurm from draining
or drained
if no node is currently using it:
sudo scontrol update nodename=sim state=idle
If jobs are currently running on the node:
sudo scontrol update nodename=sim state=resume
To check more detailed status of a running job or pending job, use the command:
scontrol show job <Job ID>