Skip to content

Es1 (Einsteinium) GPU Cluster

Es1 or Einsteinium is an institutional GPU cluster that was deployed to meet the growing computational demand for researchers doing machine learning and deep learning. The system is named after the chemical element with symbol Es and atomic number 99 which was discovered at Lawrence Berkeley National Laboratory in 1952 and in honor of Albert Einstein who developed the theory of relativity.

Es1 is a partition consisting of multiple GPU node types to address the different research needs. These include:

Accelerator Nodes GPUs per Node/GPU Memory CPU Processor CPU Cores CPU RAM Infiniband
NVIDIA 2080TI 12 4x 11 GB Intel Xeon Silver 4212 8 96GB FDR
NVIDIA V100 15 2x 32 GB Intel Xeon E5-2623 8 64GB/192GB FDR
NVIDIA GRTX8000 1 4x 48 GB AMD EPYC 7713 64 512 GB HDR
NVIDIA A40 30 4x 48 GB AMD EPYC 7742 64 512 GB FDR
NVIDIA A100 1 4x 80 GB AMD EPYC 7713 64 512 GB HDR
NVIDIA H100 5 8x 80 GB Intel Xeon Platinum 8480+ 112 1 TB NDR

H100 and CBORG

Currently, we have five NVIDIA H100 nodes in our datacenter but only three are available on slurm to users. One H100 node (8 GPUs) is used for LLM inference by CBORG .

How to specify desired GPU card(s)

Due to hardware configuation, special attention is needed to ensure the ratio of CPU-core# to GPU#

Examples:

  • Request one V100 card: --cpus-per-task=4 --gres=gpu:V100:1 --ntasks 1
  • Request two A40 cards: --cpus-per-task=16 --gres=gpu:A40:2 --ntasks 2
  • Request three H100 cards: --cpus-per-task=14 --gres=gpu:H100:3 --ntasks 3
  • Request one A100 cards: --cpus-per-task=16 --gres=gpu:A100:1 --ntasks 1

Example slurm script

Here is an example slurm script that requests one NVIDIA A40 GPU card.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:A40:1
#SBATCH --time=1:00:00

module load ml/pytorch
python train.py