Skip to content

PyTorch

Loading PyTorch

module load ml/pytorch

PyTorch versions

Use module spider pytorch to get information on the versions of pytorch available as modules.

module load ml/pytorch will additionally load other dependent modules such as cuda.

If you use jupyter server on lrc-openondemand, pytorch kernels torch 2.0.1 ard torch 2.3.1 are available.

Multi-GPU jobs

A sample for a multi-GPU PyTorch code can be found on the Distributed PyTorch tutorial examples on github. The SLURM script provided in the pytorch examples folder can be adapted to run on our cluster. The SLURM script provided below runs the multinode.py pytorch script on four A40 GPU cards distributed over two nodes:

#SBATCH --job-name=ddp_on_A40
#SBATCH --partition=es1
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --account=<ACCOUNT_NAME>
#SBATCH --time=01:00:00
#SBATCH --qos=es_normal
#SBATCH --gres=gpu:A40:2

module load ml/pytorch

allocated_nodes=$(scontrol show hostname $SLURM_JOB_NODELIST)
nodes=${allocated_nodes//$'\n'/ }
nodes_array=($nodes)
head_node=${nodes_array[0]}

echo Head Node: $head_node
echo Node List: $nodes

srun torchrun --nnodes 2 \
              --nproc_per_node 2 \
              --rdzv_id $RANDOM \
              --rdzv_backend c10d \
              --rdzv_endpoint $head_node:29500 \
              $cwd/multinode.py 500 10