Using the GPU Server — A Complete Guide

Overview

The group GPU server, sim.csl.toronto.edu, currently has 1 TB of NVMe SSD for its boot drive (where all home directories are located at), 12 TB of hard drive space for its data (located at /data), 256 GB of memory, 40 CPU cores, and 3 NVIDIA RTX A4500 GPUs with 20 GB CUDA memory each.

For technical support, the point of contact is Baochun.

House Rules

Logging In

Access to Wireless Access Point in BA 4176 — iQua Network

Using tmux

Disk Quotas

Setting Up Virtual Environments for Python

Running Jobs Interactively

Submitting Batch Jobs

Checking the GPU Status

Running Plato

Administrative Notes

House Rules

Here is an overview of the house rules. For more details, refer to the later sections in this guide.

Your account on sim.csl.toronto.edu is created for you and you only. Do not share your account with anyone. If your research collaborator needs a guest account, let me know and we’ll work something out. A shared account will be banned permanently.
Before you run anything that takes more than 10 minutes, please use the srun or the sbatch command to run it as a job. For example:

srun --time=3:00:00 -c 12 --gres=gpu:1 --mem=36G ./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato

will run a job with 12 CPU cores, and 36 GB of physical memory. Since the duration of the job is not specified, the default duration is 3 hours. Refer to the later sections of this guide for more detailed documentation.

GPU jobs that request resources using either srun or sbatch can use only the following three combinations:

1 GPU + 12 CPU cores + 36 GB memory (1 unit per hour, no time limit)
2 GPUs + 24 CPU cores + 72 GB memory (2 units per hour, up to 24 hours)
3 GPUs + 36 CPU cores + 108 GB memory (3 units per hour, up to 12 hours)

Do not request any other combinations of GPU + CPU cores + memory.

To improve fairness while maximizing utilization, please use the following manual accounting policies based on an honour-based system:

In a shared file called /share/accounting.txt, please enter three pieces of information after you submit your job:

<Your username>, the units you requested, your job ID

For example:

bli, 15, 3592

If I just submitted a job with 3592 as its job ID that used 3 units and lasts 5 hours.

If your project used up too many units on sim compared to the others and sim is currently undergoing high demand, please make serious attempts of using Compute Canada, for example on narval. Recently, jobs on narval are typically scheduled within an hour. On weekends, it may be even faster. Please read the Complete Guide in the internal website before using Compute Canada.
If you submit a CPU-only job, request no more than 4 CPU cores and 16 GB of memory. Your job should be able to run immediately.
If you run any job related to machine learning, all datasets and model checkpoints must be placed in your own directory in /data. If you run any other I/O-bound or I/O intensive job, please make sure that your files involved in these I/O-bound jobs are stored in your own directory in /data.
Please clean up your unused files under /data/<your user name> from time to time and make sure your usage is below 1 TB. The command to find out your total usage in /data is:

du -hc /data/<your user name>

Logging in

To log in initially, use the command:

ssh <your username>@sim.csl.toronto.edu

To change your password initially after you log in, use the passwd command:

passwd

To change your contact information so that others can view them using the finger command, use the chfn command:

bli@sim:~$ chfn
Password: 
Changing the user information for bli
Enter the new value, or press ENTER for the default
	Full Name: Baochun Li
	Room Number []: BA 4118
	Work Phone []: 416-946-7338
	Home Phone []: 416-946-7338
bli@sim:~$

It is optional but if you add your phone number, other members in the group can just run the command finger bli (where bli is your username) to find it:

bli@sim:~$ finger bli
Login: bli            			Name: Baochun Li
Directory: /home/bli                	Shell: /bin/bash
Office: BA 4118, 416-946-7338		Home Phone: 416-946-7338
On since Sat Apr  9 13:42 (UTC) on pts/0 from 128.100.100.128
   1 second idle
No mail.
No Plan.
bli@sim:~$

If you wish to change your login shell from the default /bin/bash (see finger information above) to /bin/zsh, you can use the chsh command:

bli@sim:~$ chsh
Password: 
Changing the login shell for bli
Enter the new value, or press ENTER for the default
	Login Shell [/bin/bash]: /bin/zsh
bli@sim:~$

You can refer to .zshrc and .zprofile for an example .zshrc and .zprofile that Baochun is using.

If you wish to save some typing when you log into sim, add an entry into your .ssh/config on your personal computer running macOS:

~$ cat >>.ssh/config
Host sim
  HostName sim.csl.toronto.edu
  User bli
<Press Control+D to exit>
~$

Where bli is your own username. You can use the command ssh sim to log into sim next time.

Last but not the least, if you wish to log into sim without using the password, transfer your existing .ssh/id_rsa.pub to sim:

scp ~/.ssh/id_rsa.pub bli@sim.csl.toronto.edu:~
Password: (Enter your password)
$ ssh sim
Password: (Enter your password)
$ mkdir .ssh
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod -R go-rwx ~/.ssh
$ chmod 600 ~/.ssh/authorized_keys
$ rm id_rsa.pub
$ exit

If you do not have .ssh/id_rsa.pub yet, refer to the Using Keychain in macOS for the necessary command to create one.

To store your private key into your keychain (in the physical memory) so that you do not need to type your passphrase either, you need to install and activate keychain on your macOS:

brew install keychain

Then activate keychain in your .zshrc by adding the following lines:

# Keychain access
eval `keychain --quiet --eval --agents ssh --inherit any id_rsa`

Refer to Using Keychain in macOS for more details on activating your keychain if you use bash rather than zsh.

Access to Wireless Access Point

In order to connect to the wireless access point in BA 4176, you will need to edit /share/known-hosts.conf on sim and provide your MAC address. For example:

host Baochun-M1-Macbook-Pro { hardware ethernet f6:4e:92:6f:68:34; }

host Baochun-15-Macbook-Pro { hardware ethernet 3c:25:eb:24:c8:15; }

Provide a descriptive name for your MAC address, so that we know which computer/iPad/phone the MAC address is associated with.

After you are done, let Baochun know and he will activate your MAC address. To get the password of the iQua network, ask any graduate student in the iQua lab or Baochun.

You can skip this step if you do not need to work in BA 4176.

Using tmux

For anything that interacts with a server, it is strongly recommended that you install and run tmux on your macOS. tmux allows your server session to be completely detached from your personal computer, so that even if you put your computer to sleep or disconnect your Internet connection, your server session will not be disrupted. To install tmux on your macOS, use brew:

brew install tmux

Refer to .tmux.conf for an example ~/.tmux.conf that Baochun is using.

tmux has also been installed on sim. You can transfer your .tmux.conf to the server to use it more effectively:

scp .tmux.conf bli@sim.csl.toronto.edu:~

Disk Quotas

Each user account has a soft per-user quota of 60 GB of space for its home directory, and a hard per-user space limit of 80 GB. If your usage is beyond the soft per-user quota, you have a grace period of 10 days to remove files and bring your usage below your quota. If your usage is above the hard limit, you will no longer be able to write to your home directory.

There are no quotas for /data, but please keep your usage below 1 TB.

In order to check your usage of disk space for your home directory, use the quota command:

(base) ~ $ quota bli
Disk quotas for user bli (uid 1000): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
/dev/mapper/ubuntu--vg-ubuntu--lv
                15197960  62914560 83886080           66568       0       0

The number under blocks indicates the current usage. For example, usage here is about 15GB.

Also, you can use command du -sh * to check the detailed space used of each file/folder under current directory. For example:

(plato) dixi@sim:~$ du -sh *
2.2G    data
8.4G    miniforge3
241M    plato

Setting Up Virtual Environments for Python

To run machine learning jobs with Python, you will need to set up virtual environments. Baochun uses miniforge, a minimalist variant of miniconda or anaconda. To set it up, use the commands:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash ./Miniforge3-Linux-x86_64.sh
rm ./Miniforge3-Linux-x86_64.sh

After setting up miniforge, the usual conda commands can be used, such as:

To list all currently installed virtual environments:

conda env list

To create and activate a new virtual environment called plato with Python 3.9:

conda create -n plato python=3.9
conda activate plato

To install PyTorch with CUDA 11.6 support in the new virtual environment:

pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116

The command above must be executed before you install any other packages in your virtual environment, such as using the pip install -r requirements.txt command or using the pip install . command. This ensures that you have the correct version of PyTorch installed to take advantage of the NVIDIA RTX A4500 GPUs on sim.

Running Jobs Interactively

To run a new job interactively, use the srun command. For example, the following command can be used to start a 1-hour job with 12 CPU cores, 1 GPU, and 36 GB of memory:

srun --time=3:00:00 -c 12 --gres=gpu:1 --mem=36G ./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato

If you do not specify the memory requirement, for each CPU core you allocate, the job will get 4096 MB of memory. If you do not specify the time duration, the default will be three hours.

Tip: If you wish to get the job started almost immediately, do not request GPU resources.

Submitting Batch Jobs

To submit a batch job, use the sbatch command with a configuration file containing all the settings for the scheduler to run your job. The following example is a typical configuration:

#!/bin/bash
#SBATCH --time=3:00:00
#SBATCH --cpus-per-task=24
#SBATCH --gres=gpu:2
#SBATCH --mem=72G
#SBATCH --output=<output_filename>.out

./run -c configs/MNIST/fedavg_lenet5.yml -b /data/bli/plato

Important notes:

--time: This is required, and specifies the maximum duration of time for the job to run. When the amount of time specified here expires, the job will be terminated. Keep this value as short as possible, since the shorter it is, the more likely your job can be scheduled early. Note that according to our house rules, each job can request no more than five hours.
--cpus-per-task=24: This represents the number of CPU cores you wish to request. If your job just needs one CPU, you do not need to add this option.
--gres=gpu:2: This represents the number of GPUs you wish to request. The maximum number is 3. Once your job has been successfully launched, it will only be able to access the GPU(s) allocated to it.
--mem=72G: The amount of main memory you wish to request. Keep this as low as you can. If you do not specify this option, 4096 MB will be requested for each CPU core you request using the --cpus-per-task option. Keep in mind that if you do not request enough memory, your job will be terminated with an out-of-memory error.
--output=<filename>.out: This sets the filename where the job’s output will be stored in. Using the logging facility in Python, rather than print(), to log to this file.

Once the configuration file is available, use the sbatch command to submit it:

sbatch <config_filename>.sh

After the job is submitted, use the squeue command to check its status:

squeue

In its output below, ST stands for STATUS, R means RUNNING, and PD means PENDING:

(base) bli@sim:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                51       sim     bash      bli  R       0:03      1 sim

To cancel a submitted job:

scancel <jobid>

Watching the job output live

Sometimes you may wish to monitor the job output as it is generated live, you can use the following command:

watch -n 1 tail -n 70 ./<log_output_filename>.out

Where the -n parameter for watch specifies the monitoring frequency in seconds (the default value is 2 seconds), and the -n parameter for tail specifies the number of lines at the end of the file to be shown. Type Control + C to exit the watch session.

In your job, you may wish to write checkpoints or other results into files, rather than depending entirely on logging output. However, if you do decide to write to files, make sure they are located in the /data directory.

Checking the GPU Status

Use the following command to check the GPU status:

nvidia-smi

If you want to monitor the GPU status, use:

watch -n 1 nvidia-smi

to display the status and update it every 1 second.

Running Plato

To train the ResNet-18 or VGG-16 model, 20 GB CUDA memory can stably handle 7 processes at the same time. Therefore, in the configuration file of your FL training session, you could set max_concurrency under trainer to 7 to run your session at the fastest speed.

For an FL training session that selects 100 clients in each communication round, a recommended configuration of a batch job would be:

#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=36
#SBATCH --gres=gpu:3
#SBATCH --mem=216G
#SBATCH --output=<output_filename>.out

./run -c configs/CIFAR10/fedavg_resnet18.yml -b /data/bli/plato

Such a configuration can let you take the most advantage of the resource on sim, using all the three GPUs and running 21 clients at the same time. It will only take about 5 minutes to finish one round.

Please note that the memory usage gradually increases over time. So your request of main memory should depend on the number of communication rounds. In the above configuration, 216 GB main memory is enough for running 36 rounds.

Administrative notes

To create a new user, use the command:

sudo adduser <username>
sudo usermod -a -G iqua <username>

To set the quota for the new user, use the command (more details about file system quotas on Ubuntu 20.04):

sudo setquota -u <username> 60G 80G 0 0 /

To show how each user uses its disk quota:

sudo repquota -a

To add additional MAC addresses to DHCP, restart the DHCP service, and then check its status to confirm that the service is running:

sudo cp /share/known-hosts.conf /etc/dhcp
sudo systemctl restart isc-dhcp-server.service
sudo systemctl status isc-dhcp-server.service

To update the group external website:

cd /share/external-website; git pull; chmod -R go+rX .

To make sure that there are no syntax errors in any of the nginx configuration files, and then restart the web server:

sudo nginx -t
sudo systemctl restart nginx

To change the state of sim in Slurm from draining or drained if no node is currently using it:

sudo scontrol update nodename=sim state=idle

If jobs are currently running on the node:

sudo scontrol update nodename=sim state=resume

To check more detailed status of a running job or pending job, use the command:

scontrol show job <Job ID>

Using the GPU Server — A Complete Guide

Overview #

House Rules #

Logging in #

Access to Wireless Access Point #

Using tmux #

Disk Quotas #

Setting Up Virtual Environments for Python #

Running Jobs Interactively #

Submitting Batch Jobs #

Watching the job output live #

Checking the GPU Status #

Running Plato #

Administrative notes #

Overview

House Rules

Logging in

Access to Wireless Access Point

Using tmux

Disk Quotas

Setting Up Virtual Environments for Python

Running Jobs Interactively

Submitting Batch Jobs

Watching the job output live

Checking the GPU Status

Running Plato

Administrative notes