Einsteinium GPU Cluster¶

Einsteinium is an institutional GPU cluster that was deployed to meet the growing computational demand for researchers doing machine learning and deep learning. The system is named after the chemical element with symbol Es and atomic number 99 which was discovered at Lawrence Berkeley National Laboratory in 1952 and in honor of Albert Einstein who developed the theory of relativity.

`es1` Partition¶

es1 is a partition consisting of multiple GPU node types to address the different research needs. These include:

Accelerator	Nodes	GPUs per Node/GPU Memory	CPU Processor	CPU Cores	CPU RAM	Infiniband
NVIDIA H100	4	8x 80 GB	Intel Xeon Platinum 8480+	112	1 TB	NDR
NVIDIA A100	1	4x 80 GB	AMD EPYC 7713	64	512 GB	HDR
NVIDIA A40	30	4x 48 GB	AMD EPYC 7742	64	512 GB	FDR
NVIDIA GRTX8000	1	4x 48 GB	AMD EPYC 7713	64	512 GB	HDR
NVIDIA V100	15	2x 32 GB	Intel Xeon E5-2623	8	64GB or 192GB	FDR

H100 and CBORG

Currently, we have five NVIDIA H100 nodes in our datacenter, four of which are available to users through SLURM on es1 partition. One H100 node (8 GPUs) is used for LLM inference by CBORG .

How to specify desired GPU card(s)¶

Due to hardware configuation, special attention is needed to ensure the ratio of CPU-core# to GPU#

Examples:

Request one V100 card: --cpus-per-task=4 --gres=gpu:V100:1 --ntasks=1
Request two A40 cards: --cpus-per-task=16 --gres=gpu:A40:2 --ntasks=2
Request three H100 cards: --cpus-per-task=14 --gres=gpu:H100:3 --ntasks=3
Request one A100 cards: --cpus-per-task=16 --gres=gpu:A100:1 --ntasks=1
Request four GRTX8000 cards: --cpus-per-task=16 --gres==gpu:GRTX8000:4 --ntasks=4

Example slurm script on es1

Here is an example slurm script that requests one NVIDIA A40 GPU card.

Single GPU on a A40 node4 GPUs on a H100 node

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:A40:1
#SBATCH --time=1:00:00

module load ml/pytorch
python train.py

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=14
#SBATCH --gres=gpu:H100:4
#SBATCH --time=1:00:00

module load ml/pytorch
python train.py

`es0` Partition¶

es0 is a partition with NVIDIA 2080 TI GPUs that do not incur Service Unit (SU) charges.

Accelerator	Nodes	GPUs per Node/GPU Memory	CPU Processor	CPU Cores	CPU RAM	Infiniband
NVIDIA 2080TI	12	4x 11 GB	Intel Xeon Silver 4212	8	96GB	FDR

Example slurm script on es0

Single GPUFour GPUs on a node

#!/bin/bash
#SBATCH --job-name=testes0
#SBATCH --account=account_name
#SBATCH --partition=es0
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --time=1:00:00

module load ml/pytorch
python train.py

#!/bin/bash
#SBATCH --job-name=testes0
#SBATCH --account=account_name
#SBATCH --partition=es0
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:4
#SBATCH --time=1:00:00

module load ml/pytorch
python train.py

Einsteinium GPU Cluster¶

es1 Partition¶

How to specify desired GPU card(s)¶

es0 Partition¶

`es1` Partition¶

`es0` Partition¶