Einsteinium GPU Cluster¶
Einsteinium is an institutional GPU cluster that was deployed to meet the growing computational demand for researchers doing machine learning and deep learning. The system is named after the chemical element with symbol Es and atomic number 99 which was discovered at Lawrence Berkeley National Laboratory in 1952 and in honor of Albert Einstein who developed the theory of relativity.
es2 Partition¶
es2 is a partition consisting of H100 and H200 GPU nodes. These nodes have been added in 2024 and 2025 to meet AI/LLM related research needs in addition to growing usage of GPUs in scientific computing.
| Accelerator | Nodes | GPUs per Node/GPU Memory | CPU Processor | CPU Cores | CPU RAM | Infiniband |
|---|---|---|---|---|---|---|
| NVIDIA H200 | 3 | 8x 141 GB | Intel Xeon Platinum 8570 | 112 | 2 TB | NDR |
| NVIDIA H100 | 4 | 8x 80 GB | Intel Xeon Platinum 8480+ | 112 | 1 TB | NDR |
How to specify desired GPU card(s)¶
The normal qos for the es2 partition is called es2_normal. The other qos values are es_debug and es_lowprio.
Due to hardware configuation, special attention is needed to ensure the ratio of CPU-core# to GPU#
Examples:
- Request three H100 cards:
--partition=es2 --gres=gpu:H100:3 --cpus-per-task=14 --ntasks=3 - Request one H200 cards:
--partition=es2 --gres=gpu:H200:1 --cpus-per-task=14 --ntasks=1
Example slurm script on es2
Here is an example slurm script that requests three NVIDIA H100 GPU cards.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es2
#SBATCH --qos=es2_normal
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=14
#SBATCH --gres=gpu:H100:3
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es2
#SBATCH --qos=es2_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=14
#SBATCH --gres=gpu:H200:1
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py
es1 Partition¶
es1 is a partition consisting of multiple GPU node types to address the different research needs. These include:
| Accelerator | Nodes | GPUs per Node/GPU Memory | CPU Processor | CPU Cores | CPU RAM | Infiniband |
|---|---|---|---|---|---|---|
| NVIDIA A100 | 1 | 4x 80 GB | AMD EPYC 7713 | 64 | 512 GB | HDR |
| NVIDIA A40 | 30 | 4x 48 GB | AMD EPYC 7742 | 64 | 512 GB | FDR |
| NVIDIA GRTX8000 | 1 | 4x 48 GB | AMD EPYC 7713 | 64 | 512 GB | HDR |
| NVIDIA V100 | 15 | 2x 32 GB | Intel Xeon E5-2623 | 8 | 64GB or 192GB | FDR |
How to specify desired GPU card(s)¶
The normal qos for the es1 partition is called es_normal.
Due to hardware configuation, special attention is needed to ensure the ratio of CPU-core# to GPU#
Examples:
- Request one V100 card:
--cpus-per-task=4 --gres=gpu:V100:1 --ntasks=1 - Request two A40 cards:
--cpus-per-task=16 --gres=gpu:A40:2 --ntasks=2 - Request one A100 cards:
--cpus-per-task=16 --gres=gpu:A100:1 --ntasks=1 - Request four GRTX8000 cards:
--cpus-per-task=16 --gres==gpu:GRTX8000:4 --ntasks=4
Example slurm script on es1
Here is an example slurm script that requests one NVIDIA A40 GPU card.
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:A40:1
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:V100:2
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py
es0 Partition¶
es0 is a partition with NVIDIA 2080 TI GPUs that do not incur Service Unit (SU) charges.
| Accelerator | Nodes | GPUs per Node/GPU Memory | CPU Processor | CPU Cores | CPU RAM | Infiniband |
|---|---|---|---|---|---|---|
| NVIDIA 2080TI | 12 | 4x 11 GB | Intel Xeon Silver 4212 | 8 | 96GB | FDR |
Example slurm script on es0
#!/bin/bash
#SBATCH --job-name=testes0
#SBATCH --account=account_name
#SBATCH --partition=es0
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py
#!/bin/bash
#SBATCH --job-name=testes0
#SBATCH --account=account_name
#SBATCH --partition=es0
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:4
#SBATCH --time=1:00:00
module load ml/pytorch
python train.py