# Science IT Technical Documentation > Science IT at LBNL technical documentation for HPC, Cloud and Data # Usage and Help # Science IT Technical Documentation - ### **Top Pages** ______________________________________________________________________ - [Getting Started with Lawrencium](hpc/getting-started/) - [Open OnDemand Portal](https://lrc-ondemand.lbl.gov/) - [MyLRC User Account Portal](https://mylrc.lbl.gov/) - [LRC Slurm Jobscript Generator](https://lbnl-science-it.github.io/lrc-jobscript/src/lrc-calculator.html) - ### **Systems Status** ______________________________________________________________________ - [HPC Service Announcements](https://it.lbl.gov/service/scienceit/high-performance-computing/status/) - [HPC Clusters Live Status - Warewulf Overview](hpc/status/) - ### **HPC Helpdesk** ______________________________________________________________________ - HPC Email Support: [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov) (Creates a ticket in AskUS) - AskUS Request Form: [HPC Help Request](https://lbl.servicenowservices.com/lbl/service_description.do?sysparm_svcdescid=b745a27cdb24360087de72840f9619cc) - Office Hours: [HPC & Science IT Office Hours: Every Wednesday 10:30-Noon (Zoom)](https://go.lbl.gov/scienceit-officehours-zoom) - ### **Science IT Consulting** ______________________________________________________________________ - Schedule a [Science IT Consulting Engagement](https://go.lbl.gov/scienceit) - Email Science IT: [scienceit@lbl.gov](mailto:scienceit@lbl.gov) - Office Hours: [HPC & Science IT Office Hours: Every Wednesday 10:30-Noon (Zoom)](https://go.lbl.gov/scienceit-officehours-zoom) - ### **Other Resources** ______________________________________________________________________ - [Science IT GitHub Repository](https://github.com/lbnl-science-it) - [LBNL IT Division Homepage](https://it.lbl.gov) - [LBNL Library E-Book Collection](https://commons.lbl.gov/display/rst/E-Books) - [Mac & PC Support Support](https://it.lbl.gov/group/it-support-services/workstation-support/) - [NERSC Technical Documentation](https://docs.nersc.gov) - [UC Berkeley Research IT Documentation](https://docs-research-it.berkeley.edu/) - ### **Can't find what you are looking for?** ______________________________________________________________________ - [Ask a LLM Model in the CBorg AI Portal](https://cborg.lbl.gov/) . - Contact us at [scienceit@lbl.gov](mailto:scienceit@lbl.gov) to discuss your needs in scientific computing, research data management, cloud computing, AI/ML workflows, and more. # High Performance Computing # High Performance Computing **Lawrencium** is the platform for the [LBNL Condo Cluster Computing (LC3)](https://it.lbl.gov/service/scienceit/high-performance-computing/lrc/computing-on-lawrencium/condo-cluster-service/) program, which provides a sustainable way to meet the midrange computing requirement for Berkeley Lab. Lawrencium is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage and backend network infrastructure. Unlike DOE computing user-facilities such as NERSC which offer leadership-tier performance but suffer from long wait times, Lawrencium provides medium-tier performance with low wait times. Berkeley Data Center Lawrencium is located at the Berkeley Data Center in Building 50B-1275. The datacenter is a 5000 sq. ft. facility dedicated for Berkeley Lab's scientific computing resources such as Lawrencium. ## Hardware Configuration Lawrencium is composed of multiple generations of hardware hence it is separated into several partitions to facilitate management and to meet the requirements to host Condo projects. The following table lists the hardware configuration for each individual partition. - [Lawrencium CPU Cluster](systems/lawrencium/) - [Einsteinium GPU Cluster](systems/einsteinium/) In addition, there are several **Supported Research Clusters**; more information on each of these can be found by selecting the desired supported cluster under `Computing Systems > Supported Research Clusters`. ## Storage and Backup Lawrencium cluster users are entitled to access the following storage systems so please get familiar with them. | Name | Location | Quota | Backup | Allocation | Description | | --- | --- | --- | --- | --- | --- | | HOME | `/global/home/users/$user` | 30GB | Yes | Per User | Home directory for permanant data storage | | GROUP-SW | `/global/home/groups-sw/$group` | 200GB | Yes | Per Group | Group directory for software and data sharing with backup | | GROUP | `/global/home/groups/$group` | 400GB | No | Per Group | Group directory for data sharing without backup | | SCRATCH | `/global/scratch/users/$user` | None | No | Per User | Scratch directory with Lustre high performance parallel file system | | CLUSTERFS | `/clusterfs/axl/$USER` | None | No | Per User | Private storage for AXL condo | | CLUSTERFS | `/clusterfs/cumulus/$USER` | None | No | Per User | Private storage for CUMULUS condo | | CLUSTERFS | `/clusterfs/esd/$USER` | None | No | Per User | Private storage for ESD condo | | CLUSTERFS | `/clusterfs/geoseq/$USER` | None | No | Per User | Private storage for CO2SEQ condo | | CLUSTERFS | `/clusterfs/nokomis/$USER` | None | No | Per User | Private storage for NOKOMIS condo | ## Recharge Model LBNL has made a significant investment in developing this platform to meet the midrange computing requirement at Berkeley Lab. The primary purpose is to provide a sustainable way to host all the condo projects while meeting the computing requirements from other users. To achieve this goal, condo users are allowed to run within their condo contributions for free. However normal users who would like to use the Lawrencium cluster are subject to the LBNL recharge rate. Condo users who would need to run outside of their condo contributions are also subject to the same recharge rate as normal users. For this purpose, condo users will obtain either one or two projects/accounts when their accounts are created on Lawrencium, per the instruction we receive from the PI of the condo project. They would need to provide the correct project when running jobs inside or outside of their condo contributions, which will be explained in detail in the Scheduler Configuration section below. The current recharge rate is $0.01 per Service Unit (1 cent per service unit, SU). Due to the hardware architecture difference we discount effective recharge rate for older generations of hardware. Please refer to the following table for the current recharge rate for each partition. ### CPU Partitions Recharge Rates | Partition | Shared or Exclusive | SU to Core CPU Hour Ratio | Effective Recharge Rate | | --- | --- | --- | --- | | lr4 | Exclusive | 0 | free | | lr5 | Exclusive | 0.50 | $0.0050 per Core CPU Hour | | lr6 | Exclusive | 0.75 | $0.0075 per Core CPU Hour | | lr7 | Shared | 1.0 | $0.01 per Core CPU Hour | | lr8 | Shared | 1.0 | $0.01 per Core CPU Hour | | lr_bigmem | Exclusive | 1.5 | $0.015 per Core CPU Hour | | cf1 | Exclusive | 0.4 | $0.004 per Core CPU Hour | | cm1 | Shared | 0.75 | $0.0075 per Core CPU Hour | | cm2 | Shared | 1.0 | $0.01 per Core CPU Hour | | es1 | Shared | 1.0 | $0.01 per Core CPU Hour | | ood_inter | Shared | 1.0 | $0.01 per Core CPU Hour | ### GPU Partitions Recharge Rates | Partition | Shared or Exclusive | SU to Core CPU Hour Ratio | Effective Recharge Rate | | --- | --- | --- | --- | | es0 | Shared | 0 | free | | es1 | Shared | 1.0 | $0.01 per Core CPU Hour | Usage Calculation The usage calculation is based on the resource that is allocated to the job instead of the actual usage of the job. For example, if a job asked for one `lr5` node with one CPU requirement (typical serial job case), and the job ran for 24 hours, since **lr5** nodes are allocated **exclusively** to the job (please refer to the following Scheduler Configuration section for more detail), the charge that this job incurred would be: **$0.0050/(core * hour) * 1 node * 24 cores/node * 24 hours = $2.88** instead of: $0.0050/(core\*hour) * 1 core * 24 hours = $0.12. ## Scheduler Configuration Lawrencium cluster uses [SLURM to submit jobs](running/slurm-overview/) as the scheduler to manage jobs on the cluster. To use Lawrencium through slurm, the partition (`lr4, lr5, lr6, es1, cm1, cm2` must be specified (`--partition=xxx`) along with account (`--account=xxx`). Currently the available QoS (Quality of Service)s are `lr_normal` and `lr_debug` and `lr_lowprio`. A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. - For normal users to use the Lawrencium resource the proper project account, e.g., `--account=ac_abc`, is needed. The QoS `lr_normal` is also required based on the partition that the job is submitted to, e.g., `--qos=lr_normal`. - If a debug job is desired the `lr_debug` QoS should be specified, e.g., `--qos=lr_debug` so that the scheduler can adjust job priority accordingly. - Condo users please use the proper condo QoS, e.g., `--qos=condo_xyz`, as well as the proper recharge account `--account=lr_xyz`. - The partition name is always required in all cases, e.g., `--partition=lr6`. Fair-share policy A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. All accounts are given equal shares value of 1. All users under each account associated within a partition is subjected to decay’g in priority based on the resources used and the overall parent account usage. Usage is a value between 0.0 and 1.0 that associates proportional usage of the system. A value of 0 indicates that the association is over-served. In other words that account has used its share of the resources and will be given a lower value of shares compared to users who have not used as much resources. - Job prioritization is based on Age, Fairshare, Partition and QOS. Note: `lr_lowprio` qos jobs are not given any prioritization and some QOS have higher values than others. - If a node feature is not provided, the job will be dispatched to nodes based on a predefined order; for `lr5` the order is: `lr5_c28`, `lr5_c20`. # Acknowledgement Please acknowledge Lawrencium in your publications. A sample statement is: Sample Acknowledgement Statement This research used the Lawrencium computational cluster resource provided by the IT Division at the Lawrence Berkeley National Laboratory (Supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231) To improve the data transfer experience of our supercluster, a separate dedicated data transfer server is available. `lrc-xfer.lbl.gov` mounts all the cluster file systems such that users can transfer data into/from any cluster filesystem. Also NERSC HPSS data transfer utilities like `hsi` and `htar` are configured to work on this server. ## Data Transfer Examples On Linux Transfer data from a local machine to Lawrencium ``` scp file-xxx $USER@lrc-xfer.lbl.gov:/global/home/users/$USER ``` ``` scp -r dir-xxx $USER@lrc-xfer.lbl.gov:/global/scratch/$USER ``` Transfer from Lawrencium to a local machine ``` scp $USER@lrc-xfer.lbl.gov:/global/scratch/$USER/file-xxx ~/Desktop ``` Transfer from Lawrencium to Another Institute ``` ssh $USER@lrc-xfer.lbl.gov # DTN ``` ``` scp -r $USER@lrc-xfer.lbl.gov:/file-on-lawrencium $USER@other-institute:/destination/path/ ``` ## Rsync: data transfer and backup tool ``` rsync -avpz file-at-local $USER@lrc-xfer.lbl.gov:/global/home/user/$USER ``` ## Data Transfer Examples On Windows - WinSCP: SFTP client and FTP client for Microsoft Windows - FileZilla: multi-platform program via SFTP # FAQs How do I get a user account on the Lawrencium cluster? Principal Investigators (PIs) can sponsor researchers, students and external collaborators for cluster accounts. Account requests and approval are done through the [MyLRC portal](https://mylrc.lbl.gov/). Either the PI or a user can place a user account creation request on the MyLRC portal. Please see the [MyLRC documentation](https://it.lbl.gov/service/scienceit/high-performance-computing/mylrc-lawrencium-account-management-system/) to learn how to submit a request. Upon request, an automatic email will be sent to your PI for approval. When the PI approves the request, it will be processed and the user is notified through email upon account availability. How do I submit my first job? Login to the cluster using any of the terminal options of your choice. You may login to cluster using the server name **`lrc-login.lbl.gov`**. Use your user name and PIN+OTP combination to login successfully. Upon login you will be end up on one of the login nodes in your home directory. Please do not submit jobs on the login nodes. You would request a compute node either using an interactive or batch slurm session. You need to know your [slurm association](../running/slurm-overview/#slurm-association) before scheduling a slurm session. Check out slurm job submission [examples here](../running/script-examples/). Depending on type of job for example CPU only, GPU, MPI, serial, you could visit the slurm script examples on this page. How do I transfer data to and from the cluster? For more details, please see the examples on the [Using the lrc-xfer DTN page](../data-transfer-node/). What is the maximum runtime / walltime you can assign a job? It depends on the `qos` and the information can be obtained using the following command ``` sacctmgr show qos name=lr_normal,lr_debug,lr_interactive,cm1_debug,cm1_normal,es_debug,es_normal,cf_debug,cf_normal,es_lowprio,cf_lowprio format=name,maxtres,maxwall,mintres ``` The maximum runtime / walltime is shown on the `MaxWall` column. The output may look like the following: ``` Name MaxTRES MaxWall MinTRES ---------- ------------- ----------- ------------- lr_debug node=4 03:00:00 cpu=1 lr_normal 3-00:00:00 cpu=1 cf_normal node=64 3-00:00:00 cpu=1 cf_debug node=4 01:00:00 cpu=1 es_normal node=64 3-00:00:00 cpu=2,gres/g+ es_debug node=4 03:00:00 cpu=2,gres/g+ cm1_debug node=4 01:00:00 cpu=1 cm1_normal node=64 3-00:00:00 cpu=1 es_lowprio cpu=2,gres/g+ cf_lowprio lr_intera+ cpu=32 3-00:00:00 ``` # Getting Started ## Project Accounts There are three primary ways/projects to obtain access to Lawrencium: - PI Computing Allowance (PCA): free 500K SUs annual renewable - Condo: purchase and contribute Condo nodes to the Lawrencium cluster - Recharge: charged at a minimal recharge rate roughly at $0.01/SU [How to get a Project Account on Lawrencium?](../accounts/project-accounts/) Project Group Directories Project group directories are not created by default. If you would like to create group directories where your group members can share data and software, please send a request to [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov). ## User Accounts You must have a user account to gain access to the Lawrencium cluster. [How to request a User Account and Submit User Agreement?](../accounts/user-accounts/) ## Logging in You'll need to generate and enter a one-time password each time you log in. [Click here for more information on logging in to Lawrencium](../accounts/loggingin/) ## Data Movement and Storage To transfer data from other computers into - or out of - your various storage directories, you can use protocols and tools like SCP, STFP, FTPS, and Rsync. If you’re transferring lots of data, the web-based [Globus Connect](../../data/globus-instructions/) tool is typically your best choice: it can perform fast, reliable, unattended transfers. The LRC supercluster’s dedicated Data Transfer Node is `lrc-xfer.lbl.gov`. For more information on getting your data onto and off of Lawrencium, please see [Data Transfer](../data-transfer-node/). ## Software Module Farm and Environment Modules A lot of software packages and tools are already built as Software Module Farm and provided for your use. These software packages and tools can be loaded and unloaded via Environment Module commands. For details see [Software Module Farm](../software/software-module-farm/) and [Module Management](../software/module-management/). ## Running Jobs When you log into a cluster, you’ll land on one of several login nodes. Here you can edit scripts, compile programs etc. However, you should not be running any applications or tasks on the login nodes, which are shared with other cluster users. Instead, use the SLURM job scheduler to submit jobs that will be run on one or more of the cluster’s many compute nodes. For details see [Slurm Overview](../running/slurm-overview/) and [Example Scripts](../running/script-examples/). ## Open OnDemand We provide interactive Apps, such as Jupyter notebooks, RStudio, MatLab, through the browser-based Open OnDemand service at . Use your LRC username and PIN+one-time password(OTP). # HPC Clusters @ Berkeley Lab - Live Status Multi-Factor Authentication (MFA) Please make sure you have configured [Multi-Factor Authentication (MFA)](../mfa/) before logging in for the first time. You’ll need to generate and enter a one-time password each time that you log in. You’ll use an application called Google Authenticator to generate these passwords, which you can install and run on your smartphone and/or tablet. For instructions on setting up and using Google Authenticator, see Multi-Factor Authentication. Once you have your PIN+OTP set up you can login to cluster using a ssh client of your choice or Linux/Mac terminal as ``` ssh username@lrc-login.lbl.gov ``` You will be prompted to enter your password. Enter your PIN+OTP without any spaces. For example if your pin is `0123` and OTP is `456789`, then you will type it as `0123456789`. Note that the characters won’t appear on the screen. Running jobs The login nodes should not be used for running jobs. They should only be used to write scripts and submit jobs to the compute nodes. More details on writing job scripts and submitting them can be found [here](../../running/script-examples/). # Multi-Factor Authentication (MFA) Link to Token Management Page Visit the [Token Management web page](https://identity.lbl.gov/otptokens/login) to manage your MFA; detailed instructions are given below. ## Introduction All users are required to use Multi-Factor Authentication (MFA) for logging into IT HPC resources such as the Lawrencium cluster and other scientific computing clusters managed by HPCS. MFA provides greater protection than regular passwords against phishing and other modern threats to your digital security. With MFA, you authenticate using your password plus a "one-time password" (OTP). As the name implies, you can use an OTP only once. All users are required to install and use an Authenticator app in their smart phones and configure it to generate OTPs. There are many such apps, some of the popular ones and known to work are Google Authenticator (GA), Microsoft Authenticator, and Authy. (Note, Duo is supposed to work, but at least two users have run into time sync problem between the Duo implementation and the LBL Radius server, thus at this time, Duo is NOT recommended). There are also desktop apps, but they somewhat negate the advantage of MFA being "something you have with you". But if you need a dekstop app, you can try to use the Authy desktop app by using the instructions in [this link](https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/using-authy/). YubiKey Lastly, it is also possible to use a [YubiKey](https://www.yubico.com/products/yubikey-5-overview/) as your MFA. This has a significant setup cost and in person verification; if you are interested in this, please email HPCS support at hpcshelp@lbl.gov for additional assistance. ## MFA Instructions ### **Step 1**: Download and install Google Authenticator on a mobile device. - In the Google Play store or iOS App Store on your smartphone or tablet, search for and install Google Authenticator (GA), Microsoft Authenticator, Authy and Duo. ### **Step 2**: Visit and Login to [OTP Token Management Interface](https://identity.lbl.gov/otptokens/login) - Berkeley Lab users can access the interface by clicking ‘Berkeley Lab Login’ in the top section. - External Users (Non Berkeley Lab users) should have linked one of their personal accounts (Facebook or Google or UC Berkeley) with the LRC HPC Cluster account and using that linked personal account credentials can access the token management interface. - External users must link their account to the email address you used while requesting an account. You would request an email to link an account. If you haven’t received an email or the link is expired, then you can request a linking email using the MyLRC portal. To do so, please go to your profile on the upper right-hand side. On your profile page at the bottom, you will see a button Request Linking Email. Please click on it, and you will get an email within half an hour. If you don’t get the email, please get in touch with us at hpcshelp.lbl.gov. ### **Step 3**: After login create a HPC Cluster/Linux Workstation Token - All HPCS managed Clusters and Linux workstations use the HPC Cluster/Linux Workstation token shown in the second section of the ‘Token Management’ page (toward the bottom). - Click on the ‘Add Token’ link and follow instructions. Important Remember the PIN that you are setting which you will use every time you access the resource. PINs can be changed or updated later from this interface itself. - After you've successfully created your new token, a QR code for that token will then be displayed. ### **Step 4** Scan the 2-D QR code - Back on your smartphone or tablet, from the menu of the Google Authenticator app, select ‘Add an account” and then “Scan a barcode”. This will store the token in GA app and its now ready to generate One Time Passwords. Note If your device does not already have a QR code reader app installed, the Google Authenticator app may first lead you through the process of installing one. Important When you access the resource remember to type the token PIN first followed the OTP from the GA app at the password prompt. For instance, if your PIN was 9999 (hint: don’t use this example as your own PIN!), and the one time-password currently displayed by Google Authenticator was 123456, you’d enter the following at the Password prompt: Password: 9999123456 ## Troubleshooting If you’ve already set up your token but are unable to log into the cluster successfully – here’s what to try: Tip 1 Make sure you’re including the PIN as part of your password At the Password: prompt, make sure that you’re entering your token PIN, followed immediately by the 6-digit one-time password from Google Authenticator. Tip 2 Wait to enter the one-time password until a new one has just been displayed If the ‘countdown clock’ indicator in the Google Authenticator app is nearing its end, signifying that the existing password is about to expire, try waiting until a new one-time password has been displayed. Then enter that new password, immediately after your PIN, at the Password: prompt. Tip 3 Check that, in your SSH command or in the configuration for your SSH application, you’re using your correct login name (i.e., your Linux user name) on the cluster Tip 4 Check that, in your SSH command or in the configuration for your SSH application, you’re using the correct hostname for the cluster’s front-end/login nodes, lrc-login.lbl.gov, or for its Data Transfer Node, lrc-xfer.lbl.gov Tip 5 Test – and if needed, reset – your token or its PIN Visit the [Token Management web page](https://identity.lbl.gov/otptokens/) to log in to this Token Management page. A list of one or more tokens should then be displayed. From this list, find your relevant token: the one that you entered into Google Authenticator on the smartphone or tablet you’re currently using. (If you want to check this further, the “TOTP number” that appears in the box for your token, on the Token Management web page, should match the TOTP number in Google Authenticator’s window on your device.) If there’s only a “Reset” option in that token’s box, click that link. Then proceed to the next step, below. If there’s a “Test” option, click that link, then enter your PIN followed immediately by your Google Authenticator 6-digit one-time password, and click the “Test Now” button. If your test(s) fail, click “Done”. Then click the “Reset PIN” link and reset your PIN. (You can even ‘reset’ it to your current PIN.) Try the “Test” option once again. Once you get a successful test of your PIN plus one-time password on this web page, you can try logging into cluster once again and see if you’re successful there, as well. Tip 6 Try creating a brand new token and add the new token to Google Authenticator, as described in the instructions above. (Before or after doing this, you can delete your existing token – both on the LBL Token Management web page and in the Google Authenticator app on your device – to avoid any confusion with the new token.) Tip 7 If none of the above tips give you a clue on what is not working, try to SSH to LRC resources from a different IP address i.e from a different computer or laptop. If that works email the IP address from where its not working to LRC support@hpcshelp@lbl.gov . # Project Accounts The Lawrencium cluster is open to all Berkeley Lab researchers needing access to high performance computing. Research collaborations are also welcome provided that there is a LBNL PI. LBNL PIs wanting to obtain access to Lawrencium for their research project will need to complete the project request at [myLRC portal](https://mylrc.lbl.gov/) , giving the details of the research activity along with a list of anticipated users. A unique group name will be created for the project and associated users. This group name will be used to setup allocations and report usage. There are three primary ways to obtain access to Lawrencium: 1. **LBNL PIs**: requesting a block of no-cost computing time via a PI Computing Allowance (PCA). This option is currently offered to all eligible Berkeley Lab PIs. For additional details on please see [PI Computing Allowance](https://it.lbl.gov/service/scienceit/high-performance-computing/lrc/computing-on-lawrencium/pi-computing-allowance/) . 1. **Condo projects**: purchasing and contributing Condo nodes to the cluster. This option is open to any Berkeley Lab staff, and provides ongoing, priority access to you and your research affiliates who are members of the Condo. For details, please see [Condo Cluster Service](https://it.lbl.gov/service/scienceit/high-performance-computing/lrc/computing-on-lawrencium/condo-cluster-service/) . 1. **Recharge use**: Berkeley lab researchers who want to use Lawrencium cluster at a minimal recharge rate, roughly at $0.01/SU. For details, please see Recharge Allocation Computing Allowance. To request a PCA, Condo or Recharge project on Lawrencium, please send your requests at [myLRC portal](https://mylrc.lbl.gov/) . Make sure choosing the desired project type on the form. ### Changing Your ProjectID and Valid ProjectID If your projectID associated with project accounts on Lawrencium expires or becomes invalid, you can request of changing your projectID by sending us email at [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov). To maintain your Lawrencium user account, you must have a valid PID. If your PID becomes invalid or you lose access to projects, for example, if you leave the last project or a PI removes you from it, your account will be temporarily blocked from logging into the cluster until a valid PID is supplied. ### Allocations **Computer Time**: We are currently not using an allocation process to allocate compute time to individual projects. Instead, usage and priority will be regulated by a scheduler policy intended to provide a level of fairness across users. If needed, a committee consisting of scientific division representatives will review the need for allocations if demand exceeds supply. **Cost**: There is a nominal charge of $25/mo/user for the use of Lawrencium to cover the costs of home directory storage and backups. PCA Project accounts are not charged for usage. Recharge accounts are charged $0.01/SU for compute. Account fees and cpu usage will appear as LRCACT and LRCCPU in the LBL Cost Browser. **Storage**: Home directory space will have a quota set at 20GB per user. Users may also use the `/clusterfs/lawrencium` shared filesystem which does not have a quota; this file system is intended for short term use and should be considered volatile. Backups are not performed on this file system. Data is subject to periodic purge policy wherein any files which are not accessed with in the last 14 days will be deleted. Users should make sure to have a back up of these files to some external permanent storage as soon as they are generated on the cluster. **Lustre**: Lustre parallel file system is also now available for Lawrencium cluster users. The file system is built with 4 OSS and 15 OST servers with a capacity of 1.8PB. The default striping is set to 4 OSTs with strip size of 1 MB. All the Lawrencium cluster users will receive a directory created under `/clusterfs/lawrencium` with the above default stripe values set. This is a scratch file system, so its mainly intended for storing large input or output files for running jobs and for all the parallel I/O needs on the Lawrencium cluster. This file system is intended for short term use and should be considered volatile. Backups are not performed on this file system. Data in scratch is subject to periodic purge policy wherein any files which are not accessed with in the last 14 days will be deleted. Users should make sure to have a back up of these files to some external permanent storage as soon as they are generated on the cluster. ### Acknowledgements Please acknowledge Lawrencium in your publications. A sample statement is: Sample Acknowledgement Statement This research used the Lawrencium computational cluster resource provided by the IT Division at the Lawrence Berkeley National Laboratory (Supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231) # User Accounts ### How to get a user account on Lawrencium / How to add additional users to your project? Accounts are added on a first come, first served basis upon approval of the PI for the project. For security reasons, access to Lawrencium will be through the use of one-time password tokens. Users will be required to complete user account requests at [myLRC portal](https://mylrc.lbl.gov/) in order to get their one-time password token generator and their account. Closing User Accounts The PI for the project or the main contact is responsible for notifying HPCS to close user accounts and the disposition of the user’s software, files and data. In some cases, users share software and data from their home directory and others may depend on them. For this reasons, only account terminations have to be requested by PI, the main account or the user of the account. Users accounts are not automatically deactivated upon termination of an employee because many people change their employment status, but remain engaged with the project. Send your requests at [myLRC portal](https://mylrc.lbl.gov/) . Questions regarding requesting new or removing accounts can be directed to [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov). # Jupyter Server The Jupyter Notebook is a web application that enables you to create and share documents that can contain a mix of live code, equations, visualizations, and explanatory text. This is an introduction to using Jupyter notebooks on Lawrencium. Before getting started, make sure you have [access to the Lawrencium cluster](../../accounts/project-accounts/). As described next, you can start a Jupyter notebook via the Open OnDemand service, which allows you to operate completely via your web browser on your local computer (e.g., your laptop). ## Jupyter notebooks on Open OnDemand ### Running a notebook 1. Connect to . 1. After [logging in](../overview/), you will get to the Open OnDemand welcome screen. Click the **Interactive Apps** pulldown. 1. Choose the **Jupyter Server** option from the list of apps. Choose the **Jupyter Server - interactive for exploration/debugging** only if you are writing/debugging code and not doing any computationally intensive tasks. 1. Fill out the form presented to you and click on **Launch**. An example of filling this form is shown in the next section. 1. Once the server is ready, you will be able to click on the **Connect to Jupyter** button to get a jupyter notebook. ### Example: Launch a Jupyter Server on the ES1 GPU parition 1. Select the following parameters to lauch a jupyter server on one GPU card of a A40 GPU node using a normal priority queue (with 16 CPU cores): - SLURM Partition: **es1** - Name of SLURM Quality of Service (QoS): **es_normal** - Number of nodes: **1** - Select GPU Type from dropdown: **es1: NVidia A40 (40 GB) 1-4x** - Select number of GPU cards to use: **1** - Number of CPU cores per Node: **16** Please also choose or enter the **SLURM Project/Account Name**, the **Wall Clock Time**, and **Name of the job** according to your needs. Example form for choosing a A40 GPU card 1. Upon clicking **Launch**, you may have to wait for the requested resource to be allocated. 1. When the server is ready, click on the **Connect to Jupyter** button to open your jupyter server session. 1. After clicking on **Connect to Jupyter**, you will enter the classic Jupyter or Jupyterlab environment. - Under File > New > Notebook, you will find several Jupyter kernels with different Python versions and packages that you can choose according to your requirements. These include: - Python 3 (ipykernel) - python through `anaconda3/2024.02-1-11.4` module - torch 2.3.1 py3.11.7 - PyTorch 2.3.1 through `ml/pytorch/2.3.1-py3.11.7` module - tf 2.15.0 py3.10.0 - TensorFlow 2.15.0 through `ml/tensorflow/2.15.0-py3.10.0` module 1. You can have your session continue to operate in the background by selecting the **Logout** button (upper right hand corner) on Open OnDemand. 1. To terminate a running Notebook, select the **My Interactive Sessions** tab on the Open OnDemand menu and click on **Delete**. Further information about working with Jupyter Notebooks can be found in the [Jupyter Documentation](https://docs.jupyter.org/en/latest/) and [JupyerLab Documentation](https://jupyterlab.readthedocs.io/en/latest/) # Using Ollama with Jupyter and VS Code The **Ollama - JupyterAI & VS Code Continue** app can be used on LRC Open Ondemand for running LLMs locally on Lawrencium compute resources. This can be useful for prototying applications that make use of LLMs, or for general experimentation. To use the app, take the following steps: 1. After clicking on the app on the **Interactive Apps** menu of LRC Open Ondemand, fill out the form with your requirements. Below is an example that will request one V100 GPU for 3 hours. GPU nodes (`es0` or `es1` partition) are recommended for running LLM models. Choose the partition, GPU type and number of GPUs according to your needs. Example form for choosing one V100 GPU card on es1 partition 1. Click **Launch**. Upon clicking **Launch**, you may have to wait for the requested resource to be allocated. 1. When the server is ready, you will get two buttons: **Connect to Jupyter** and **Connect to VS Code** as shown in the image below. ## Ollama on Jupyter If you click on **Connect to Jupyter**, you will get a Jupyter Lab instance with [Jupyter AI](https://jupyter-ai.readthedocs.io/) extension. To chat using the default Ollama model, click on the Jupyter AI chat interface on the left-side of the JupyterLab workspace. To change models or settings, click on the settings icon of the Jupyter AI interface on the top right corner. Jupyter AI Interface ### Changing model on Jupyter AI To change the model, you will need to type in the model name from the list of currently available models; for example: `devstral:24b`, `gemma3:12b`. A complete list can be obtained by using the `ollama list` command on a terminal (File > New > Terminal). `ollama list` ``` [user@hostname ~]$ ollama list NAME ID SIZE MODIFIED devstral:24b c4b2fa0c33d7 14 GB 6 days ago codegemma:2b 926331004170 1.6 GB 11 days ago nomic-embed-text:v1.5 0a109f422b47 274 MB 4 weeks ago deepseek-coder:6.7b ce298d984115 3.8 GB 4 weeks ago deepseek-coder:1.3b 3ddd2d3fc8d2 776 MB 4 weeks ago llama3.2:1b baf6a787fdff 1.3 GB 4 weeks ago qwen3:1.7b 458ce03a2187 1.4 GB 4 weeks ago qwen3:30b-a3b 2ee832bc15b5 18 GB 4 weeks ago qwen3:8b e4b5fd7f8af0 5.2 GB 4 weeks ago deepseek-r1:8b 28f8fd6cdc67 4.9 GB 4 weeks ago deepseek-r1:7b 0a8c26691023 4.7 GB 4 weeks ago deepseek-r1:1.5b a42b25d8c10a 1.1 GB 4 weeks ago gemma3:4b a2af6cc3eb7f 3.3 GB 4 weeks ago gemma3:12b f4031aab637d 8.1 GB 4 weeks ago gemma3:12b-it-qat 5d4fa005e7bb 8.9 GB 4 weeks ago gemma3:1b 8648f39daa8f 815 MB 4 weeks ago ``` ### Using `ollama` python library on Jupyter notebooks You can use [`ollama` python](https://github.com/ollama/ollama-python) module to interact with Ollama in a notebook using the default `Python 3 (ipykernel)` kernel. For example: `ollama-python` example ``` import ollama import os client = Client(host=os.environ["OLLAMA_HOST"]) response = client.chat(model='llama3.2:1b', messages=[{'role': 'user', 'content': 'Hello'}]) ``` ## Ollama on VS Code If you click on **Connect to VS Code**, you will get a VS Code server instance with [Continue](https://marketplace.visualstudio.com/items?itemName=Continue.continue) extension. You can use the Continue Chat feature by clicking on the Continue button on the left-side of VS Code workspace. VS Code Continue Interface # Open OnDemand Overview We provide various interactive Apps through the browser-based Open OnDemand service available at . The available Apps/services include: - Jupyter notebooks - Ollama - RStudio - Matlab - VS Code - File browsing - Slurm job listing - Terminal/shell access (under the "Clusters" tab) ## Logging In 1. Visit in your web browser. 1. Login using CILogon. At the login page, please select the appropriate institute. If you have a Berkeley Lab identity please select Lawrence Berkeley National Laboratory and use your Berkeley Lab Identity to login to Open OnDemand. ## Service Unit Charges Open OnDemand apps may launch Slurm jobs on your behalf when you request sessions on a slurm partition. Open OnDemand refers to these jobs as *"interactive sessions."* Since these are just Slurm jobs, service units are charged for interactive sessions the same way normal jobs are charged. Interactive, for exploration/debugging mode Sessions can be run on `.ood0` nodes by choosing `interactive, for exploration/debugging` versions of the apps. Nodes ending in `.ood0` are shared nodes that are provided for low-intensity jobs. These should be treated like login nodes (that is, intensive computation is not allowed). Interactive sessions running on `.ood0` nodes are charged at 1 SU per CPU-hour. Job time is counted for interactive sessions as the total time the job runs. The job starts running as soon as a node is allocated for the job. *The interactive session may still be running even if you do not have it open in your web browser.* You can view all currently running interactive sessions under My Interactive Sessions. When you are done, you may stop an interactive session by clicking “Delete” on the session. There are several ways to monitor usage: - Since Open OnDemand submits jobs through Slurm, you can [monitor usage as you would monitor your regular Slurm Jobs](../../running/monitor-jobs/). - View currently running (and recent) sessions launched by Open OnDemand under `My Interactive Sessions`. - View all currently running jobs under `Jobs > Active Jobs`. ## Using Open OnDemand Here are some of the services provided via Open OnDemand. Services on Open OnDemand Access the Files App from the top menu bar under *Files > Home Directory*. Using the Files App, you can use your web browser to: - View files in the Lawrencium filesystem. - Create and delete files and directories. - Upload and download files from the Lawrencium filesystem to your computer. - We recommend using Globus for large file transfers. View and cancel active Slurm jobs from *Jobs > Active Jobs*. This includes jobs started via `sbatch` and `srun` as well as jobs started via Open OnDemand. Open OnDemand allows Lawrencium shell access from the top menu bar under *Clusters > LRC Shell Access*. ## Interactive Apps Additionally, Open OnDemand provides the following interactive apps. - Desktop App - Jupyter Server - MATLAB - RStudio Server - VS Code Server Click on a tab below to learn more about these interactive apps. Interactive Apps on Open OnDemand The Desktop App allows you to launch an interactive desktop on the Lawrencium cluster. You will be able to launch GUI applications directly on the desktop. **Steps:** - Select *Desktop* from the *Interactive Apps* menu. - Provide the job specifications you want for the Desktop app. - Once Desktop is ready, click *Launch Desktop* and the Desktop will open in a new tab. See the [Jupyter documentation page](../jupyter-server/) for instructions on using Jupyter notebooks via Open OnDemand. This service replaces the JupyterHub service that we formerly provided. **Steps:** - Select *Jupyter Server* from the *Interactive Apps* menu. - Provide the job specifications you want for the Jupyter server. - Once Jupyter is ready, click *Connect to Jupyter* to access your Jupyter session. The MATLAB app allows your to use [MATLAB](https://www.mathworks.com/products/matlab.html) GUI on Lawrencium cluster. **Steps:** - Select *MATLAB* from the *Interactive Apps* menu. - Specify the amount of time you would like the MATLAB sessions to run. - Once the MATLAB session is ready, click *Launch MATLAB* to access MATLAB GUI. The RStudio server allows you to use [RStudio](https://www.rstudio.com/) on Lawrencium cluster. **Steps:** - Select *RStudio Server* from the *Interactive Apps* menu. - Provide the job specification you want for the RStudio server. - Once RStudio is ready, click *Connect to RStudio* to access RStudio. The VS Code server allows you to use [VS Code](https://code.visualstudio.com/) on Lawrencium cluster. **Steps:** - Select *VS Code Server* from the *Interactive Apps* menu. - Provide the job specification you want for the VS Code server. - Once VS Code Server is ready, click *Connect to VS Code* to access VS Code. Job run time Service units are charged based on job run time. The job may still be running if you close the window or log out. When you are done, shut down an interactive app by clicking *"Delete"* on the session under *My Interactive Sessions*. ## Troubleshooting Open OnDemand ### Common problems Problem: Open OnDemand login pop-up box keeps reappearing If you have trouble logging into OOD (including if the login pop-up box keeps reappearing after you enter your username and password), you may need to make sure you have completely exited out of other OOD sessions. This could include closing browser tab(s)/window(s), clearing your browser cache and clearing relevant cookies. You might also try running OOD in an incognito window (or if using Google Chrome, in a new user profile). ### General information for troubleshooting Logs and scripts for each interactive session with Open OnDemand are stored in: ``` ~/ondemand/data/sys/dashboard/batch_connect/sys ``` There are directories for each interactive app type within this directory. For example, to see the scripts and logs for a Jupyter session, you might look at the files under: ``` ~/ondemand/data/sys/dashboard/batch_connect/sys/lrc_jupyter/output/da19101d-70b0-43c1-84ff-7d9f0e739419 ``` # Adding Packages and Kernels ## Installing Python Packages A variety of standard Python packages (such as numpy, scipy, matplotlib and pandas) are available automatically on `anaconda3` module. To see what packages are available, open a Terminal in the [Jupyter server](../jupyter-server/) or open a Terminal on Lawrencium in the usual fashion. Then load the `anaconda3` module and list the installed packages: ``` module load anaconda3 conda list ``` pip vs python -m pip To reduce potential environment mismatch (especially in the presence of multiple python installations), it is recommended to use `python -m pip` rather than `pip`. You can use `pip` to install or upgrade packages and then use them in a Jupyter notebook, but you will need to make sure to install the new versions or additional packages in your `home` or `scratch` directories because you do not have write permissions to the module directories. You can use ``` python -m pip install --user $PACKAGENAME ``` to install the python package to `$HOME/.local`. So, if you need to install additional packages, you can load the desired Python module and then use `pip` to install in your `home` directory. For example, you can install the `cupy` package with: ``` module load anaconda3/2024.02 module load gcc/11.4.0 module load cuda/12.2.1 python -m pip install --user --no-cache-dir cupy-cuda12x ``` Package installed in this manner in your `$HOME/.local/lib/python3.xx` will be available to the **Python 3 (ipykernel)** jupyter kernel provided through the `anaconda3/2024.02` module. You can also install packages in a virtual environment or a conda environment and create a kernel associated with that environment. See examples in the next sections. ## Adding New Kernels Jupyter supports notebooks in dozens of languages, including Python, R, and Julia. Not all of these languages or packages are supported by default in our Open OnDemand jupyter server. The ability to create custom kernels is useful if you need to create your own kernel for a language that is not supported by default or if you want to customize the environment, for example create a jupyter kernel for a virtual environment or a conda environment. To list the available jupyter kernels: ``` jupyter kernelspec list ``` Example: Add a kernel for a virtual environment As an example, let us create a virtual environment in our `$SCRATCH` directory called `cudapython` that installs the [CUDA Python](https://nvidia.github.io/cuda-python/latest/) packages. ``` module load anaconda3/2024.02 module load gcc/11.4.0 cuda/12.2.1 cd $SCRATCH python -m venv ./cudapython source cudapython/bin/activate python -m pip install -U cuda-python cuda-parallel cuda-cooperative python -m pip install -U cuda-core numba-cuda nvmath-python ipykernel python -m ipykernel install --user --name cudapython \ --display-name "CUDA Python" \ --env PATH $PATH \ --env LD_LIBRARY_PATH $LD_LIBRARY_PATH ``` The above command will create a `kernel.json` file in `~/.local/share/jupyter/kernels/cudapython`. You can manually edit this file to edit the paths and environment variables. The new kernel will show up on the jupyter server app on Open OnDemand. Depending on your packages and whether you had to import additional module before installing the package, you may not have to pass the `--env PATH` and `--env LD_LIBRARY_PATH` values in the command above. In this examples, the `PATH` and `LD_LIBRARY_PATH` variables exported are important because of the `cuda-python` packages make use of `cuda/12.2.1` module that we imported. ### Manually creating a new kernel To add a new kernel to your Jupyter environment, you can also manually create a subdirectory within `$HOME/.local/share/jupyter/kernels`. Within the subdirectory, you’ll need a configuration file, `kernel.json`. Each new kernel should have its own subdirectory containing a configuration file. As an example, below is the content of `~/.local/share/jupyter/kernels/cudapython/kernel.json` file that we just created using the `python -m ipykernel install` command in the previous section. You can create and/or edit this file as needed. Note that below we have used $SCRATCH instead of the actual path but you will need to provide the full path to your python executable. ``` { "argv": [ "$SCRATCH/cudapython/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "CUDA Python", "language": "python", "metadata": { "debugger": true }, "env": { "PATH": "/location/to/path1:/location/to/path2", "LD_LIBRARY_PATH": "/location/to/path1:/location/to/path2" } } ``` Managing kernels for Jupyter Please review the Jupyter documentation on [Managing kernels for Jupyter](https://jupyter-client.readthedocs.io/en/latest/kernels.html) for more details regarding the format and contents of this configuration file. In particular, please make sure `$PATH, $LD_LIBRARY_PATH, $PYTHONPATH`, and all other environment variables that you use in the kernel are properly populated with the correct values. ### Using a conda environment Another approach to adding a new (Python) kernel to your Jupyter environment is to create a conda environment and add it as a kernel to Jupyter. When in Jupyter, you will then be able to select the name from the kernel list, and it will be using the packages you installed. Follow these steps to do this (replacing $ENV_NAME with the name you want to give your conda environment): ``` module load anaconda3 conda create --name=$ENV_NAME ipykernel conda activate $ENV_NAME python -m ipykernel install --user --name $ENV_NAME ``` From example, below we give an example of a custom python kernel in a conda environment that uses `python=3.12` and install `numpy=2.0.0` from `conda-forge` channel. ``` module load anaconda3/2024.02 conda create --name=numpy2test python=3.12 ipykernel conda activate numpy2test conda config --env -add channels conda-forge conda install numpy=2.0.0 python -m ipykernel install --user --name numpy2 --display-name="Numpy v2 (Python 3.12)" ``` Now you can choose the kernel you just created from the kernel list in your Jupyter environment on Open OnDemand. ### Using apptainer images It is also possible to create custom kernels using container images. For example, if you would like to use the NVIDIA RAPIDS docker image, first you need to convert it to an apptainer `sif` file. This can be done using `apptainer pull` command in your `scratch` directory as: ``` export APPTAINER_CACHEDIR=$SCRATCH export APPTAINER_TMPDIR=$SCRATCH cd $SCRATCH apptainer pull docker://nvcr.io/nvidia/rapidsai/notebooks:25.04-cuda12.8-py3.12 ``` You can then manually create a `kernel.json` file in `~/.local/share/jupyter/kernels/rapids` with the following content. ``` { "argv": [ "apptainer", "exec", "--nv", "/path/to/notebooks_25.04-cuda11.8-py3.12.sif", "python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "RAPIDS 25.04 Kernel", "language": "python", "metadata": { "debugger": true } } ``` GNU Parallel is a shell tool for executing jobs in parallel on one or more computers. It’s a helpful tool for automating the parallelization of multiple (often serial) jobs, in particular allowing one to group jobs into a single SLURM submission to take advantage of the multiple cores on a given Lawrencium node. A job can be a single core serial task, multi-core or MPI application. A job can also be a command that reads from a pipe. The typical input is a list of parameters required for each task. GNU parallel can then split the input and pipe it into commands in parallel. GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially, and output names can be easily correlated to input file names for easy post-data processing. This makes it possible to use output from GNU parallel as input for other programs. Below we’ll show basic usage of GNU parallel and then provide an extended example illustrating submission of a job that uses GNU parallel. For full documentation see the [GNU parallel man page](https://www.gnu.org/software/parallel/man.html) and [GNU parallel tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html) . Loading GNU parallel on Lawrencium GNU Parallel is available as a module on Lawrencium. To load GNU Parallel: ``` module load parallel ``` ## Basic Usage To motivate usage of GNU parallel, consider how you might automate running multiple individual tasks using a simple bash for loop. In this case, our example command involves copying a file. We will copy `file1.in` to `file1.out`, `file2.in` to `file2.out`, etc. ``` for (( i=1; i <= 3; i++ )); do cp file${i}.in file${i}.out done ``` That’s fine, but it won’t run the tasks in parallel. Let’s use GNU parallel to do it in parallel: ``` parallel -j 2 cp file{}.in file{}.out ::: 1 2 3 ls file*out # file1.out file2.out file3.out ``` Based on `-j`, that will use two cores to process the three tasks, staring the third task when a core becomes free from having finished either the first or second task. The `:::` syntax separates the input values `1 2 3` from the command being run. Each input value is used in place of `{}` and `cp` command is run. ### Extended example Here we’ll put it all together (and include even more useful syntax) to parallelize use of the bioinformatics software BLAST across multiple biological input sequences. Below are three sample files, including a SLURM job submission script where GNU parallel launches parallel tasks, a bash file to run a serial task and a task list. 1. Job submission script: e.g., blast.slurm, where gnu-parallel flags are setup and independent tasks are launched in parallel: ``` #!/bin/bash #SBATCH --job-name=job-name #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=2 #SBATCH --time=2:00:00 ## Command(s) to run (example): # module load bio/blast-plus parallel export WDIR=/your/desired/path cd $WDIR export JOBS_PER_NODE=$SLURM_CPUS_ON_NODE # ## when each task is multi-threaded, say NTHREADS=2, then JOBS_PER_NODE should be revised as below ## JOBS_PER_NODE=$(( $SLURM_CPUS_ON_NODE / NTHREADS )) ## when memory footprint is large, JOBS_PER_NODE needs to be set ; $SLURM_CPUS_ON_NODE # echo $SLURM_JOB_NODELIST |sed s/\,/\\n/g > hostfile ## when GNU parallel can't detect core# of remote nodes, say --progress/--eta, ## core# should be prepended to hostnames. e.g. 32/n0149.savio3 ## echo $SLURM_JOB_NODELIST |sed s/\,/\\n/g |awk -v cores=$SLURM_CPUS_ON_NODE '{print cores"/"$1}'hostfile< # parallel --jobs $JOBS_PER_NODE --slf hostfile --wd $WDIR --joblog task.log --resume --progress \ --colsep ' ' -a task.lst sh run-blast.sh {} output/{/.}.blst ``` - `-a`: task list as input to GNU parallel - `–sshloginfile/-slf`: compute node list - `–{}` : take values from the task list, one line at a time as parameters to the application/serial task (e.g. run-blast.sh) - \`–{/.}\`\`:remove path and file extension - `–output/{/.}`: specify output/file.blst as the blast result; - `–colsep`: used as column separator, such as comma, tab, space for the input task list - `–jobs`: number of tasks per node - `–wd`: landing work dir on compute nodes, default is $HOME - `–joblog`: keep track of completed tasks - `–resume`: used as a checkpoint, allow jobs to resume - `–progress`: display job progress - `–eta`: estimated time to finish - `–load`: threshold for CPU load, e.g. 80% - `–noswap`: new jobs won’t be started if a node is under heavy memory load - `–memfree`: check if there is enough free memory, e.g. 2G - `–dray-run`: display commands to run without execution Note: `–log logfile` pairs with the `resume` option for production runs. A unique name of logfile is recommended, such as `$SLURM_JOB_NAME.log` Otherwise, job rerun will not start when the same logfile exists 2. Serial bash script: ``` #!/bin/bash module load bio/blast-plus parallel blastp -query $1 -db ../blast/db/img_v400_PROT.00 -out $2 -outfmt 7 -max_target_seqs 10 -num_threads $3 ``` where $1, $2 and $3 are the three parameters required for each serial task 3. Task list: list of parameters for tasks in the format of one line for one task. The parameters required for each task must to be on the same line separated by an eliminators: ``` [user@n0002 ~] cat task.lst ../blast/data/protein1.faa ../blast/data/protein2.faa ``` In this example, although each task takes three parameters (run-blast.sh), only one parameter is provided in the task list task.lst. The 2nd parameter, which specifies the output, is correlated to output/{/.}.blst in blast.slurm. And the third parameter num_threads is fixed. However, If core# varies from task to task, task.lst could be revised as: ``` [user@n0002 ~] cat task.lst ../blast/data/protein1.faa 2 ../blast/data/protein2.faa 4 ``` For best practice, test your code on an interactive node before submitting jobs to clusters. In addition: task list can a sequence of commands, such as: ``` [user@n0002 ~] cat commands.lst echo “host = ” ‘`hostname`’ sh -c “echo today date = ; date” sh -c “echo today date = ; date” ``` ``` [user@n0002 ~] parallel -j 2 < commands.lst host = n0148.savio3 today date = Sat Apr 18 14:07:33 PDT 2020 ``` ### Useful external links - [GNU Parallel man page](https://www.gnu.org/software/parallel/man.html) - [GPU Parallel tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html) ## Monitoring the status of running batch jobs To monitor a running job, you need to know the SLURM job ID of that job, which can be obtained by running ``` squeue -u $USER ``` ## Monitoring the job from a login node If you suspect your job is not running properly, or you simply want to understand how much memory or how much CPU the job is actually using on the compute nodes, Savio provides a script “wwall” to check that. The following provides a snapshot of node status that the job is running on: ``` wwall -j $your_job_id ``` while ``` wwall -j $your_job_id -t ``` provides a text-based user interface (TUI) to monitor the node status when the job progresses. To exit the TUI, enter “q” to quit out of the interface and be returned to the command line. You can also see a “top”-like summary for all nodes by running wwtop from a login node. You can use the page up and down keys to scroll through the nodes to find the node(s) your job is using. All CPU percentages are relative to the total number of cores on the node, so 100% usage would mean that all of the cores are being fully used. ## Monitoring the job by logging into the compute node Alternatively, you can login to the node your job is running on as follows: ``` srun –jobid=$your_job_id –pty /bin/bash ``` This runs a shell in the context of your existing job. Once on the node, you can run top, htop, ps, or other tools. If you’re running a multi-node job, the commands above will get you onto the first node, from which you can ssh to the other nodes if desired. You can determine the other nodes based on the `SLURM_NODELIST` environment variable. ## Checking finished jobs First of all, you should look for the SLURM output and error files that may be created in the directory from which you submitted the job. Unless you have specified your own names for these files they will be names slurm-.out and slurm-.err. After a job has completed (or been terminated/cancelled), you can review the maximum memory used via the sacct command. ``` sacct -j –format=JobID,JobName,MaxRSS,Elapsed ``` MaxRSS will show the maximum amount of memory that the job used in kilobytes. You can check all the jobs that you ran within a time window as follows ``` sacct -u –starttime=2019-09-27 –endtime=2019-10-04 \ –format JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,NodeList ``` Please see `man sacct` for a list of the output columns you can request, as well as the SLURM documentation for the [`sacct`](https://slurm.schedmd.com/sacct.html) command. Here we show some example job scripts that allow for various kinds of parallelization such as: jobs that use fewer cores than available on a node, GPU jobs, low-priority condo jobs, and long-running PCA jobs. Please refer to [Slurm Association](../slurm-overview/#slurm-association) on how to use the command `sacctmgr` to obtain details of accounts, partitions, and quality of service (qos) that are needed in a slurm script. ## Example Set 1 ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --account=account_name #SBATCH --time=0:0:30 ## Run command ./a.out ``` ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=1 #SBATCH --ntasks-per-node=20 #SBATCH --cpus-per-task=1 # # Wall clock limit: #SBATCH --time=00:00:30 # ## Command(s) to run (example): ./a.out ``` ``` #!/bin/bash #SBATCH --job-name=job-name #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=2 #SBATCH --cpus-per-task=2 #SBATCH --time=2:00:00 # ## Command(s) to run (example): module load bio/blast/2.6.0 module load gnu-parallel/2019.03.22 export WDIR=/your/desired/path cd $WDIR # set number of jobs based on number of cores available and number of threads per job export JOBS_PER_NODE=$(( $SLURM_CPUS_ON_NODE / $SLURM_CPUS_PER_TASK )) # echo $SLURM_JOB_NODELIST |sed s/\,/\\n/g > hostfile # parallel --jobs $JOBS_PER_NODE --slf hostfile --wd $WDIR --joblog task.log --resume --progress -a task.lst sh run-blast.sh {} output/{/.}.blst $SLURM_CPUS_PER_TASK ``` ## Example Set 2 ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 # #SBATCH --time=00:00:30 ## Command(s) to run (example): export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./a.out ``` ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --ntasks=40 (1) # # Processors per task: #SBATCH --cpus-per-task=1 #SBATCH --time=00:00:30 # ## Command(s) to run (example): module load gcc openmpi mpirun ./a.out ``` 1. Number of MPI tasks needed ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=2 #SBATCH --ntasks-per-node=20 #SBATCH --cpus-per-task=1 #SBATCH --time=00:00:30 # ## Command(s) to run (example): module load gcc openmpi mpirun ./a.out ``` ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=partition_name #SBATCH --qos=qos_name #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=5 (1) #SBATCH --time=00:00:30 (2) # ## Command(s) to run (example): module load gcc openmpi export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun ./a.out ``` 1. Processors per task needed 1. Wall clock limit ## GPU Job `es1` partition consists of GPU nodes with three generations of NVIDIA GPU cards (V100, GTX 2080TI, A40). Please take a look at the details on this [page](https://it.lbl.gov/resource/hpc/lawrencium/). A compute node with different GPU types and numbers can be allocated using slurm in the following way. - General format: `--gres=gpu[type]:count` - The above format can schedule jobs on nodes with V100, GTX 2080TI, or A40 GPU cards. To specify a particular card: - GRTX2080TI: `--gres=gpu:GRTX2080TI:1` (up to 3 or 4 GPUs) - V100 : `--gres=gpu:V100:1`(up to 2 GPUs) - A40: `--gres=gpu:A40:1` (up to 4 GPUs) - H100: `--gres=gpu:H100:1` (up to 8 GPUs) To help the job scheduler effectively manage the use of GPUs, your job submission script must request multiple CPUs (usually two) for each GPU you use. The scheduler will reject jobs submitted that do not request sufficient CPUs for every GPU. This ratio should be one:two. Here’s how to request two CPUs for each GPU: the total of CPUs requested results from multiplying two settings: the number of tasks (`--ntasks=`) and CPUs per task (`--cpus-per-task=`). For instance, in the above example, one GPU was requested via `--gres=gpu:1`, and the required total of two CPUs was thus requested via the combination of `--ntasks=1` and --cpus-per-task=2 . Similarly, if your job script requests four GPUs via `--gres=gpu:4`, and uses `--ntasks=8`, it should also include `--cpus-per-task=1` to request the required total of eight CPUs. Note that in the `--gres=gpu:n` specification, `n` must be between 1 and the number of GPUs on a single node (which is provided [here for the various GPU types](../../systems/einsteinium/)). This is because the feature is associated with how many GPUs per node to request. Examples: - Request one V100 card: `--cpus-per-task=4 --gres=gpu:V100:1 --ntasks 1` - Request two A40 cards: `--cpus-per-task=16 --gres=gpu:A40:2 --ntasks 2` - Request three H100 cards: `--cpus-per-task=14 --gres=gpu:H100:3 --ntasks 3` ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=es1 #SBATCH --qos=es_normal #SBATCH --nodes=1 #SBATCH --ntasks=1 # # Processors per task (please always specify the total number of processors twice the number of GPUs): #SBATCH --cpus-per-task=2 # #Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:V100:[1-4] with the type included #SBATCH --gres=gpu:1 # # Wall clock limit: #SBATCH --time=1:00:00 # ## Command(s) to run (example): ./a.out ``` SLURM is a resource manager and job scheduling system developed by SchedMD. The trackable resources (TRES) include Nodes, CPUs, Memory and Generic Resources (GRES). Slurm has three key functions: 1. Allocate resources exclusively/non-exclusive to nodes, 1. start/execute and monitor the resources on a node, and 1. arbitrate pending and queued work. Nodes are grouped together within a partition. The partitions can also be considered as job queues and each of which has a set of constraints such as job size limit, time limit, default memory limits, and the number of nodes, etc. Submitting a job to the system requires you to specify a partition, an account, a Quality of Service (QoS), the number of nodes, wallclock time limits and optional memory (default will be used if not specified). Jobs within a partition will then be allocated to nodes based on the scheduling policy, until all resources within a partition are exhausted. There are several basic commands you will need to know to submit jobs, cancel jobs, and check status. These are: - `sbatch` – submit a job to the batch queue system, e.g., `sbatch myjob.sh` - `squeue` – check the current jobs in the batch queue system, e.g., `squeue` By default, `squeue` command will list all the jobs in the queue system. To list only the jobs pertaining to your username $USER, use the command: ``` squeue -u $USER ``` - `sinfo` – view the current status of the queues, e.g., `sinfo` - `scancel` – cancel a job, e.g., `scancel 1234567` - `srun` – to run interactive jobs Interactive job using `srun` You can use `srun` to request and run an interactive job. The following example (please change the `account_name` and the `partition_name` according to your needs) requests a `lr4` node for 1 hour. ``` srun -p lr4 -A account_name -q lr_normal -N 1 -t 1:00:00 --pty bash ``` The prompt will change to indicate that you are on the compute node allocated for the interactive job once the interactive job starts: ``` srun: job 7566529 queued and waiting for resources srun: job 7566529 has been allocated resources [user@n0105 ~]$ ``` If you are done working on the interactive job before the allocated time, you can release the resource by using `exit` on the interactive node. ## Slurm Association A Slurm job submission script includes a list of SLURM directives (or commands) to tell the job scheduler what to do. This information, such as user account, cluster partition and QoS (Quality of Service), have to be paired correctly in your job submission scripts. The Slurm command ‘sacctmgr‘ provide accounts, partitions and the QoSs that available to you as a user. ``` sacctmgr show association -p user=$USER ``` The command returns the output for a hypothetical example user `userA`. To be specific, `userA` has access to a PI Computing Allowance `pc_acctB`, departmental cluster nano and the condo account `lr_acctA` with respect to different partitions. Each line of this output indicates a specific combination of an account, a partition, and QoSes that you can use in a job script file, when submitting any individual batch job: ``` Cluster|Account|User|Partition|Share|…|QOS|… perceus-00|pc_acctB|userA|ood_inter|1|||||||||||||lr_interactive||| perceus-00|pc_acctB|userA|cm1|1|||||||||||||cm1_debug,cm1_normal||| perceus-00|pc_acctB|userA|lr6|1|||||||||||||lr_debug,lr_normal||| perceus-00|pc_acctB|userA|lr_bigmem|1|||||||||||||lr_normal||| perceus-00|pc_acctB|userA|lr5|1|||||||||||||lr_debug,lr_normal||| perceus-00|pc_acctB|userA|lr4|1|||||||||||||lr_debug,lr_normal||| perceus-00|pc_acctB|userA|lr3|1|||||||||||||lr_debug,lr_normal||| perceus-00|pc_acctB|userA|cf1|1|||||||||||||cf_debug,cf_normal||| perceus-00|pc_acctB|userA|es1|1|||||||||||||es_debug,es_lowprio,es_normal||| ``` The Account, Partition, and QOS indicate which partitions and QoSes you have access to under each of your account(s). The Software Module Farm (SMF) is managed by the [Lmod](https://lmod.readthedocs.io/en/latest/index.html) Environment Module system to set the appropriate environment variables in your shell needed to make use of the individual software packages. Environment Modules are used to manage users’ runtime environments dynamically. This is accomplished by loading and unloading modulefiles which contain the application specific information for setting a user’s environment, primarily the shell environment variables, such as `PATH`, `LD_LIBRARY_PATH`, etc. Modules are useful in managing different applications, and different versions of the same application in a cluster environment. The following commands are some frequently useful commands to manipulate modules in your environment: ``` module load SOFTWARE # Load the module “SOFTWARE” module unload SOFTWARE # Unload the module “SOFTWARE” module available # List all modules available for loading module list # List all modules currently loaded ``` ## Finding Modules module spider `module spider SOFTWARE` is a useful module command that lists the module(s) named `SOFTWARE` and information on additional modules that you may need to load before `SOFTWARE` is available to load For example: ``` [user@n0000 ~]$ module spider hdf5 ------------------------------------------------------------------------- hdf5: hdf5/1.14.3 ------------------------------------------------------------------------- You will need to load all module(s) on any one of the lines below before the "hdf5/1.14.3" module is available to load. gcc/10.5.0 openmpi/4.1.3 gcc/10.5.0 openmpi/4.1.6 gcc/11.4.0 openmpi/4.1.3 gcc/11.4.0 openmpi/4.1.6 intel-oneapi-compilers/2023.1.0 intel-oneapi-mpi/2021.10.0 Help: HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. ``` This means that you will need to load the appropriate compiler + mpi library combination before being able to load the corresponding hdf5 module. For example, you can do the following: ``` module load gcc/11.4.0 module load openmpi/4.1.6 module load hdf5 ``` ## Environment Modules Usage Examples There are some basic commands that users will need to know to work with the Environment Modules system, which all starts with the primary “module” command, and followed by a subcommand listed below (“|” means “or”, e.g., “module add” and “module load” are equivalent). For detail usage instruction of the “module” command please run “man module”. - `module avail` – List all available modulefiles in the current `MODULEPATH`. - `module list` – List loaded modules. - `module add|load modulefile …` – Load modulefile(s) into the shell environment. - `module rm|unload modulefile` … – Remove modulefile(s) from the shell environment. - `module swap|switch [modulefile1] modulefile2` – Switch loaded modulefile1 with modulefile2. - `module show|display modulefile …` – Display information about one or more modulefiles. - `module whatis [modulefile …]` – Display the information about the modulefile(s). - `module purge` – Unload all loaded modulefiles. Below we demonstrate how to use these commands. Depending on which system you have access to and when you are reading this instruction, what you see here could be different from the actual output from the system that you work on. module avail ``` [user@n0000.scs00 ~]$ module avail ------------------------- /global/software/rocky-8.x86_64/modfiles/compilers --------------------- gcc/10.5.0 gcc/11.4.0 (D) intel-oneapi-compilers/2023.1.0 llvm/17.0.4 nvhpc/23.9 ------------------------- /global/software/rocky-8.x86_64/modfiles/tools ------------------------- automake/1.16.5 ffmpeg/6.0 lmdb/0.9.31 proj/9.2.1 tcl/8.6.12 awscli/1.29.41 gdal/3.7.3 m4/1.4.19 protobuf/3.24.3 tmux/3.3a bazel/6.1.1 glog/0.6.0 matlab/r2022a qt/5.15.11 unixodbc/2.3.4 cmake/3.27.7 gmake/4.4.1 mercurial/6.4.5 rclone/1.63.1 valgrind/3.20.0 code-server/4.12.0 gurobi/10.0.0 nano/7.2 snappy/1.1.10 vim/9.0.0045 eigen/3.4.0 imagemagick/7.1.1-11 ninja/1.11.1 spack/0.20.1 emacs/29.1 leveldb/1.23 parallel/20220522 swig/4.1.1 ------------------------- /global/software/rocky-8.x86_64/modfiles/langs ------------------------- anaconda3/2024.02-1-11.4 openjdk/11.0.20.1_1-gcc-11.4.0 r/4.3.0-gcc-11.4.0 ``` module list ``` [user@n0000 ~]$ module list No modules loaded ``` module load ``` [user@n0000 ~]$ module load gcc [user@n0000 ~]$ module load openmpi [user@n0000 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 2) ucx/1.14.1 3) openmpi/4.1.6 ``` On systems in which a hierarchical structure is used, some of modulefiles will only be available after the root modulefile is loaded. The Lawrencium cluster uses a hierarchical structure for several packages that depend on a particular compiler and/or MPI package. For example, after loading `gcc/11.4.0` and `openmpi/4.1.6` in the example above, `module avail` will show new packages that can now be loaded: hierarchical structure ``` [user@n0000 ~]$ module avail ------------- /global/software/rocky-8.x86_64/modfiles/openmpi/4.1.6-4xq5u5r/gcc/11.4.0 ------------ boost/1.83.0 hmmer/3.4 ncl/6.6.2 netcdf-fortran/4.6.1 fftw/3.3.10 intel-oneapi-mkl/2023.2.0 (D) nco/5.1.6 netlib-lapack/3.11.0 (D) gromacs/2023.3 lammps/20230802 ncview/2.1.9 netlib-scalapack/2.2.0 hdf5/1.14.3 mumps/5.5.1 netcdf-c/4.9.2 petsc/3.20.1 ------------------------ /global/software/rocky-8.x86_64/modfiles/gcc/11.4.0 ----------------------- antlr/2.7.7 gsl/2.7.1 openmpi/4.1.6 (L,D) blast-plus/2.14.1 idba/1.1.3 picard/2.25.7 bowtie2/2.5.1 intel-oneapi-mkl/2023.2.0 prodigal/2.6.3 cuda/11.8.0 intel-oneapi-tbb/2021.10.0 samtools/1.17 cuda/12.2.1 (D) netlib-lapack/3.11.0 ucx/1.14.1 (L) cudnn/8.7.0.84-11.8 openblas/0.3.24 udunits/2.2.28 cudnn/8.9.0-12.2.1 (D) openmpi/4.1.3 vcftools/0.1.16 ``` For example, now `gromacs` can be loaded. Note ``` [sadhikari@n0000 ~]$ module load gromacs [sadhikari@n0000 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 3) openmpi/4.1.6 5) intel-oneapi-tbb/2021.10.0 7) netlib-lapack/3.11.0 2) ucx/1.14.1 4) fftw/3.3.10 6) intel-oneapi-mkl/2023.2.0 8) gromacs/2023.3 ``` `module show` command displays information about the module. module show ``` [user@n0000 ~]$ module show fftw --------------------------------------------------------------------------------------------------------- /global/software/rocky-8.x86_64/modfiles/openmpi/4.1.6-4xq5u5r/gcc/11.4.0/fftw/3.3.10.lua: --------------------------------------------------------------------------------------------------------- whatis("Name : fftw") whatis("Version : 3.3.10") whatis("Target : x86_64") whatis("Short description : FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/od d data, i.e. the discrete cosine/sine transforms or DCT/DST). We believe that FFTW, which is free software, s hould become the FFT library of choice for most applications.") help([[Name : fftw]]) help([[Version: 3.3.10]]) help([[Target : x86_64]]) ]]) help([[FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST). We believe that FFTW, which is free software, should become the FFT library of choice for most applications.]]) depends_on("openmpi/4.1.6") prepend_path("PATH","/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3.10-cf4npbktu eip6tnwqf2qstog7on4pyfk/bin") prepend_path("MANPATH","/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3.10-cf4npb ktueip6tnwqf2qstog7on4pyfk/share/man") prepend_path("PKG_CONFIG_PATH","/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3.1 0-cf4npbktueip6tnwqf2qstog7on4pyfk/lib/pkgconfig") prepend_path("CMAKE_PREFIX_PATH","/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3 .10-cf4npbktueip6tnwqf2qstog7on4pyfk/.") append_path("MANPATH","") ``` The `module purge` command unloads all currently loaded modulefiles. module purge ``` [user@n0000 ~]$ module purge [user@n0000 ~]$ module list No modules loaded ``` ## User Generated Modulefiles User generated modulefiles Users can generate their own modulefiles to load user-specific applications. The path of the modulefiles needs to be appended to the `MODULEPATH` environment variable as follows: 1). For bash users, please add the following to ~/.bashrc: ``` export MODULEPATH=$MODULEPATH:/location/to/my/modulefiles ``` 2). For csh/tcsh users, please add the following to ~/.cshrc: ``` setenv MODULEPATH ”$MODULEPATH”:/location/to/my/modulefiles ``` Software Module Farm (SMF) Service For an overview of the Software Module Farm (SMF) as a service on non-Lawrencium systems, please follow [this link](https://it.lbl.gov/service/scienceit/high-performance-computing/scientific-cluster-services/software-module-farm/) . In this page, we will focus on the Software Module Farm (SMF) software packages and their usage on the Lawrencium cluster. The Software Module Farm provides a comprehensive and well-tested suite of software modules for Lawrencium users. Several types of software modules are available: 1. **Tools**: Tool modules are built and compiled with the default system `gcc` compiler. They have no other dependencies. For the current operating system, the `gcc` system compiler is `gcc 8.5.0`. 1. **Compilers**: Other common compilers and newer versions of `gcc`; for example: `gcc 11.4.0`. Many applications and libraries not found in the **Tools** are built with these compilers and can be accessed after loading the corresponding compiler. 1. **Languages**: Language modules include additional compilers and interpreters for specific languages such as `python`, `R` and `julia`. 1. **Applications**: Domain specific applications such as biology and machine learning packages. 1. **Submodules**: Submodules include libraries and packages which depend on a particular compiler or language module. Due to this dependency, submodules will only be visible once the associated language or core compiler module has been loaded. For example, `hdf5` submodule is only visible once you load `gcc` and `openmpi` modules. See the [Module Management](../module-management/) page for details on how to use the `module` command for module management on Lawrencium. ## Software installation by Users Users are encouraged to install domain scientific software packages or local software module farms in their home or group space. Users don’t have admin rights, but most software can be installed with the flag `--prefix=/dir/to/your/path`. # VASP on Lawrencium The Vienna Ab initio Simulation Package ([VASP](https://www.vasp.at/)) is a suite for quantum-mechanical molecular dynamics (MD) simulations and electronic structure calculations. VASP is a licensed package and the license is sold on a research group basis. HPCS group has compiled a VASP 6.4.1 version of the package on Lawrencium. License holder users or group of users can get access to package on request. New licensees need to complete the [VASP: Access Request form](https://docs.google.com/forms/d/e/1FAIpQLSe9dO-dcdcsVqqhiYv4TDhxtjmezjzxs9GvOfF9_C3Lje-E5A/viewform?usp=dialog) to be added to the linux groups authorized to use VASP. Please, provide the proof of purchase with this request. Please feel free to reach out to us at hpcshelp@lbl.gov if you would like us to update the version of the package. VASP binaries provided on Lawrencium are compiled targeting CPU or GPU partitions. Following guidelines can help users to run vasp calculation. ## VASP CPU Binary (Intel Compiler) The `vasp/6.4.1-cpu-intel` module is compiled using the intel compiler and mpi modules. To load the module: ``` module load intel-oneapi-compilers/2023.1.0 module load intel-oneapi-mpi/2021.10.0 module load vasp/6.4.1-cpu-intel ``` Sample VASP CPU slurm script Please modify the `account`, `qos`, `ntasks`, `time` and other variables in the sample job scripts below appropriately before running your job. ``` #!/bin/bash #SBATCH --job-name="check" #SBATCH --ntasks=14 #SBATCH --cpus-per-task=4 #SBATCH --output=%x.out #SBATCH --error=%x.err #SBATCH --time=1:00:00 #SBATCH --partition=lr7 #SBATCH --account= #SBATCH --qos=lr_normal module load intel-oneapi-compilers/2023.1.0 module load intel-oneapi-mpi/2021.10.0 module load vasp/6.4.1-cpu-intel export OMP_NUM_THREADS=4 srun --mpi=pmi2 vasp_std ``` ## VASP GPU Binary (NVHPC SDK) The `vasp/6.4.1-gpu` module is compiled using NVHPC SDK. To load the module: ``` module load nvhpc/23.11 module load vasp/6.4.1-gpu ``` Sample VASP GPU slurm script The following sample script runs VASP on a `H100 es1` node using 2 `H100` GPUs. ``` #!/bin/bash #SBATCH --job-name="rfm_VASPCheck_1401ba9b" #SBATCH --ntasks=2 #SBATCH --cpus-per-task=14 #SBATCH --output=vasp_job.out #SBATCH --error=vasp_job.err #SBATCH --time=1:00:00 #SBATCH --partition=es1 #SBATCH --account= #SBATCH --qos=es_normal #SBATCH --gres=gpu:H100:2 module load nvhpc/23.11 module load vasp/6.4.1-gpu export PMIX_MCA_psec=native export OMP_NUM_THREADS=1 mpirun -np 2 vasp_std ``` ## Compiling VASP Users can also compile the package on their own in their home or group space. Please reach out to us if you need help setting up makefile for GNU, intel or nvhpc compilers. # GCC Compilers on Lawrencium Several `gcc` compiler versions are available on Lawrencium. The default `gcc` compiler is `gcc/11.4.0` available through `module load gcc`. Two other `gcc` versions are available: `gcc/10.5.0` and `gcc/13.2.0`. To load a `gcc` module other than the default, specify the version; for example: ``` module load gcc/13.2.0 ``` The C, C++ and fortran compilers in the `gcc` compiler suite are: - C: `gcc` - C++: `g++` - Fortran: `gfortran` ## Additional References - [GCC 13.2.0 Manual](https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/) - [GCC 11.4.0 Manual](https://gcc.gnu.org/onlinedocs/gcc-11.4.0/gcc/) - [GCC 10.5.0 Manual](https://gcc.gnu.org/onlinedocs/gcc-10.5.0/gcc/) - [GCC Online Documentation](https://gcc.gnu.org/onlinedocs/) # Intel OneAPI Compilers on Lawrencium `intel-oneapi-compilers` version `2023.1.0` is available on Lawrencium which consists of both the new LLVM-based oneAPI compilers `icx, icpx, ifx` and the intel classic compilers `icc, icpc, ifort`. The default intel-oneapi-compilers\` module can be loaded as: ``` module load intel-oneapi-compilers ``` ## LLVM-based oneAPI Compilers The version of LLVM-based oneAPI compilers `icx, icpx, ifx` follow the version of the oneapi package. Some relevant reference pages on the intel documentation website for the `2023.1.0` version of oneAPI compilers installed on Lawrencium are listed below: - [Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference](https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-1/overview.html) - [Intel Fortran Compiler Classic and Intel Fortran Compiler Developer Guide and Reference](https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/overview.html) ## Intel Classic Compilers Version scheme of Intel Classic Compilers The versions of Intel classic compilers in the module `intel-oneapi-compilers/2023.1.0` is different than `2023.1.0`. The version of `ifort, icc, icpc` compilers in the module `intel-oneapi-compilers/2023.1.0` is `2021.9.0` Some relevant reference pages on the intel documentation website for the `2021.9.0` version of Intel classic compilers installed on Lawrencium are listed below: - [Intel C++ Compiler Classic Developer Guide and Reference](https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-9/overview.html) - [Intel Fortran Compiler Classic and Intel Fortran Compiler Developer Guide and Reference](https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/overview.html) ## Additional References - [Porting Guide for ICC users to DPCPP or ICX](https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-icc-users-to-dpcpp-or-icx.html) - [Porting Guide for IFORT to IFX](https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-ifort-to-ifx.html) # NVHPC: NVIDIA HPC SDK The NVIDIA HPC SDK version 23.11 is available on Lawrencium. You can load the `nvhpc` module as: ``` module load nvhpc/23.11 ``` The `nvhpc` module consists of the following compilers: - C: `nvc`, - C++: `nvc++` - Fortran: `nvfortran` The CUDA C and CUDA C++ compiler driver [`nvcc`](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) is also present in the module. ## CUDA Versions `nvhpc/23.11` includes two CUDA toolkit versions: `CUDA 11.8` and `CUDA 12.3`. You can choose a particular version by using the compiler flag `-gpu`; for example, use `-gpu=cuda11.8` to choose `CUDA 11.8` when compiling a program using `nvhpc` compilers. ## Target Architecture The `-tp=` flag can be used to specify a target processor when compiling using `nvhpc` compilers. -fast `-fast` compiler option is useful to choose an optimal set of vectorization options, but can lead to an auto-selection of the `-tp` option. This might mean that your compiled code may not work on all Lawrencium partitions as Lawrencium includes a wide range of hardware across several partitions. ## MPI `nvhpc` module includes a version of openmpi. MPI wrapper programs such as `mpicc`, `mpicxx` and `mpifort` are available once `nvhpc` is loaded. Running MPI programs To run MPI jobs compiled with the `nvhpc` module on Lawrencium, use `mpirun` instead of `srun`. Special considerations For some GPU nodes with AMD CPU hosts such as `A40` nodes, we have found that the following environment variable needs to be set when using `nvhpc`'s `mpirun` command to launch MPI jobs: ``` export PMIX_MCA_psec=native ``` In addition, `--bind-to core` which is the default for `mpirun` might not work; in which case, you can try `--bind-to none` or `--bind-to socket`. For example: ``` mpirun -np 2 --bind-to socket ./program ``` ## Additional References - [NVIDIA HPC SDK Version 23.11 Documentation](https://docs.nvidia.com/hpc-sdk/archive/23.11/index.html) - [NVIDIA HPC SDK Version 23.11 Release Notes](https://docs.nvidia.com/hpc-sdk/archive/23.11/hpc-sdk-release-notes/index.html) # Using R on Lawrencium Version `4.4.0` of R is available to users; the R module can be loaded as: ``` module load r ``` Some commonly used r-packages are already installed with the R module available on the system. To view the list of packages already installed, use the following command in the R command prompt (either in your terminal or RStudio session on Open OnDemand): ``` installed.packages() ``` Another module `r-spatial` is available for a standard set of R packages for spatial data: ``` module load r-spatial ``` # Using Julia on Lawrencium Julia is available on Lawrencium as a module. ``` $ module av julia -------------- /global/software/rocky-8.x86_64/modfiles/langs --------------- julia/1.10.2-11.4 ``` ``` $ module load julia $ julia _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.10.2 (2024-03-01) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> ``` ## Using Julia through Jupyter on Open OnDemand A Julia kernel has been added to Jupyter on Open OnDemand that allows you to use Julia on the Jupyter server on Open OnDemand. # Using Python on Lawrencium ## Python packages Python 2 The `rocky-8` operating system in Lawrencium has installation of `Python 3.6` and `Python 2.7`. To use these, use the command `python3` and `python2` respectively without loading other python modules. Several Python modules are available on the Lawrencium software module farm. There are two basic (with only a few additional site-packages) python modules provided. To list these python modules: ``` $ module av python ---------- /global/software/rocky-8.x86_64/modfiles/langs ---------- python/3.10.12-gcc-11.4.0 python/3.11.6-gcc-11.4.0 (D) $ module load python/3.10.12 $ python Python 3.10.12 (main, Mar 22 2024, 00:44:12) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` To load one of these modules use: `module load python/3.10.12` or `module load python/3.11.6`. Additional site-packages installed in these python modules are: `numpy`, `scipy`, `matplotlib`, `mpi4py`, `h5py`,`netCDF4`, `pandas`, `geopandas`, `ipython` and `pyproj`. User installation of python packages You can use pip to install or upgrade packages. ``` python -m pip install --user $PACKAGENAME ``` to install a python package to `~/.local` directory. The package libraries are usually installed in a sub-directory for each python version; for example: `~/.local/lib/python3.10/site-packages/`. Choosing python modules Please note that the linear algebra backend for `numpy` in these two python modules (`python/3.11.6` and `python/3.10.12`) is the openBLAS library whereas the Anaconda distributions (`anaconda3/2024.02` and `anaconda3/2024.10`) use the Intel MKL library. Some linear algebra operations can be faster using `numpy` through the `anaconda3` module. ## Anaconda environment We also provide `anaconda3/2024.02` and `anaconda3/2024.10` python environments that have many popular scientific and numerical python libraries pre-installed. Examples of loading anaconda3 ``` $ module load anaconda3/2024.02 $ python Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> ``` ``` $ module load anaconda3/2024.10 $ python Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct 4 2024, 13:27:36) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. ``` Several Jupyter kernels are available to access `tensorflow` and `pytorch` conda environments from the [Jupyter server](../../../openondemand/jupyter-server/) on [Open OnDemand](../../../openondemand/overview/). [Click here](../../../openondemand/packages-kernels/) for more information on installing python packages and jupyter kernels for use on the Jupyter server on Open OnDemand. ## Intel Distribution of Python Additionally the [Intel Distribution of Python (Python 3.9)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.c1qvsx) is available, and can be loaded as: ``` module load intelpython/3.9.19 ``` When you load `intelpython`, `intel-oneapi-compilers` and `intel-oneapi-mpi` are also loaded because we have added `mpi4py` package linked to Intel MPI library to the Intel Distribution of Python. ## Using Dask [Dask](https://www.dask.org/) is available both in the `anaconda3` and `intelpython` modules. Dask can be useful when you are working with large datasets that don't fit in the memory of a single machine. Dask implements lazy evaluation, task scheduling and data chunking that makes it useful when performing analysis on large datasets. Dask JupyterLab Extension Dask JupyterLab Extension can be used to manage Dask clusters and monitor it through various dashboard plots in JupyterLab panes. To install dask-labextension once you have a python module loaded: ``` python -m pip install dask-labextension ``` # CUDA Toolkit The NVIDIA CUDA Toolkit, consisting of Nvidia-GPU-accelerated libraries, C/C++ compiler and various related tools, is available under `gcc` compiler tree. `cuda/11.8.0` and `cuda/12.2.1` are available after loading a `gcc` module. For example: ``` module load gcc/11.4.0 module load cuda/12.2.1 ``` loads CUDA Toolkit version 12.2.1. The environment variable `CUDA_HOME` is set by the `cuda` module. ## Additional References - [CUDA Toolkit Information](https://developer.nvidia.com/cuda-toolkit) - [CUDA Toolkit 12.2.1 Documentation](https://docs.nvidia.com/cuda/archive/12.2.1/) - [CUDA Toolkit 11.8.0 Documentation](https://docs.nvidia.com/cuda/archive/11.8.0/) # FFTW ## Loading FFTW FFTW on Lawrencium can be loaded after loading a MPI library. For example, to load FFTW installed using the default gcc compiler and the default Open MPI on Lawrencium: ``` [user@n0000 ~]$ module load gcc openmpi [user@n0000 ~]$ module avail fftw --------- /global/software/rocky-8.x86_64/modfiles/openmpi/4.1.6-4xq5u5r/gcc/11.4.0 -------- fftw/3.3.10 ``` ``` [user@n0000 ~]$ module load fftw/3.3.10 ``` ## Compiling programs using FFTW library To compile using the loaded `fftw3` library, we need the appropriate `CFLAGS` and `LDFLAGS` during compilation and linking. These can be obtained in Lawrencium using ``` [user@n0000 ~]$ pkg-config --cflags --libs fftw3 -I/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3.10-cf4npbktueip6tnwqf2qstog7on4pyfk/include -L/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/fftw-3.3.10-cf4npbktueip6tnwqf2qstog7on4pyfk/lib -lfftw3 ``` Note that the result above does not include linker flags for MPI FFTW routines. To compile program using `MPI FFTW`, in addition to `-lfftw3` we also need `-lfftw3_mpi` and `-lm` ([see here](https://www.fftw.org/fftw3_doc/Linking-and-Initializing-MPI-FFTW.html) ). Therefore, to compile using MPI FFTW library: ``` mpicc -o output $(pkg-config --cflags --libs fftw3) -lfftw_mpi -lm example_mpi_fftw.c ``` Compiling using `rpath` To compile using `rpath`, you need to add the following: ``` -Wl,-rpath,$(pkg-config --variable=libdir fftw3) ``` Compiling with `rpath` adds the `libdir` to the runtime library search path in the executable file. # HDF5 ## Loading HDF5 HDF5 on Lawrencium can be loaded after loading a MPI library. For example, to load HDF5 installed under the default gcc compiler and the default Open MPI on Lawrencium: ``` [user@n0000 ~]$ module load gcc openmpi [user@n0000 ~]$ module avail hdf5 ------------- /global/software/rocky-8.x86_64/modfiles/openmpi/4.1.6-4xq5u5r/gcc/11.4.0 -------------- hdf5/1.14.3 ``` ``` [user@n0000 ~]$ module load hdf5 ``` ## Compiling programs using HDF5 library Let's look at an example of compiling a simple C example `ph5_file_create.c` from [hdf5-examples](https://github.com/HDFGroup/hdf5-examples/tree/master/C/H5PAR) . The example creates a HDF5 file named `SDS_row.h5`. To compile using the loaded `HDF5` library, we need the appropriate `CFLAGS` and `LDFLAGS` during compilation and linking. These can be obtained in Lawrencium using ``` [user@n0000 ~]$ pkg-config --cflags --libs hdf5 -I/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/hdf5-1.14.3-6763puu3e5vxq4vmbaosgiv4yhzjb46s/include -L/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/hdf5-1.14.3-6763puu3e5vxq4vmbaosgiv4yhzjb46s/lib -lhdf5 ``` To include these directly in the compilation process, we can do the following: ``` mpicc -o ph5_file_create $(pkg-config --cflags --libs hdf5) ph5_file_create.c ``` # Intel MKL Library Intel MKL library is available under both `gcc` and `intel-oneapi-compilers` on Lawrencium. Intel MKL library can be loaded after loading a compiler/mpi combination. For example: ``` [user@n0000 ~]$ module load gcc openmpi [user@n0000 ~]$ module load intel-oneapi-mkl [user@n0000 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 2) ucx/1.14.1 3) openmpi/4.1.6 4) intel-oneapi-tbb/2021.10.0 5) intel-oneapi-mkl/2023.2.0 ``` Similarly, we can load the MKL library with the intel oneapi compilers and mpi as: ``` [user@n0000 ~]$ module load intel-oneapi-compilers [user@n0000 ~]$ module load intel-oneapi-mpi [user@n0000 ~]$ module load intel-oneapi-mkl [user@n0000 ~]$ module list Currently Loaded Modules: 1) intel-oneapi-compilers/2023.1.0 3) intel-oneapi-tbb/2021.10.0 2) intel-oneapi-mpi/2021.10.0 4) intel-oneapi-mkl/2023.2.0 ``` MKL Link Line Advisor Use the Intel® oneAPI Math Kernel Library [(oneMKL) Link Line Advisor tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html) to obtain the appropriate compiler and linker options depending on your use case. # NetCDF ## Loading netCDF NetCDF on Lawrencium can be loaded after loading a MPI library. For example, to load netCDF installed using the default gcc compiler and the default Open MPI on Lawrencium: ``` [user@n0000 ~]$ module load gcc openmpi [user@n0000 ~]$ module avail netcdf ------------- /global/software/rocky-8.x86_64/modfiles/openmpi/4.1.6-4xq5u5r/gcc/11.4.0 -------------- netcdf-c/4.9.2 netcdf-fortran/4.6.1 ``` As you can see on the output of `module avail netcdf`, a C version of the library `netcdf-c` and a fortran version of the library `netcdf-fortran` are available. ## Compiling programs using netCDF library Let's look at an example of compiling a simple fortran example `simple_xy_rd.f90` from [Example netCDF programs](https://www.unidata.ucar.edu/software/netcdf/examples/programs/) . The example creates a netcdf file with a two-dimensional array of sample data. To compile using the `netcdf-fortran` library, we need the appropriate `CFLAGS` and `LDFLAGS` during compilation and linking. These can be obtained in Lawrencium using ``` [user@n0000 ~]$ pkg-config --cflags netcdf-fortran -I/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/netcdf-fortran-4.6.1-fjshq66ynuoqqbtns2n3pwerlpymqjkg/include -I/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/netcdf-c-4.9.2-heo4zhdmupk4ru7x6aujkoptuceeilh2/include [user@n0000 ~]$ pkg-config --libs netcdf-fortran -L/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/netcdf-fortran-4.6.1-fjshq66ynuoqqbtns2n3pwerlpymqjkg/lib -lnetcdff ``` To include these directly in the compilation process, we can do the following: ``` gfortran -o simple_xy_rd $(pkg-config --cflags --libs netcdf-fortran) simple_xy_rd.f90 ``` Before running the binary `simple_xy_rd`, you have to add the `netcdf-fortran` library path to the `LD_LIBRARY_PATH` environment variable. ``` export LD_LIBRARY_PATH=/global/software/rocky-8.x86_64/gcc/linux-rocky8-x86_64/gcc-11.4.0/netcdf-fortran-4.6.1-fjshq66ynuoqqbtns2n3pwerlpymqjkg/lib:$LD_LIBRARY_PATH ``` # AlphaFold 3 on Lawrencium [AlphaFold 3](https://github.com/google-deepmind/alphafold3.git) is a new AI model developed by [Google DeepMind](https://deepmind.google/) and [Isomorphic Labs](https://www.isomorphiclabs.com/) for generating 3D predictions of biological systems. The software package and the public database is now available ont the Lawrencium cluster. ## Genetic Databases The genetic database required for AlphaFold 3 is saved under the shared directory `/clusterfs/collections/Alphafold3/public-db` on the cluster. ## Model Parameters The model parameters are the result of training the AlphaFold model and required for inference pipeline of AlphaFold 3. The model parameters are distributed separately from the source code by Google DeepMind and subject to terms [Model Parameters Terms of Use](https://github.com/google-deepmind/alphafold3/blob/v3.0.0/WEIGHTS_TERMS_OF_USE.md) . Lawrencium users interested in using AlphaFold 3 are required to abide to above terms. Users can request a personal copy of the model parameters directly from Google DeepMind by filling out this [form](https://forms.gle/svvpY4u2jsHEwWYS6). If you have any questions about fields of the form then you may send us an inquiry at [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov). Once you get response and directions from Google DeepMind on obtaining model parameters you may save the parameters file in your home directory or project directory (if you are sharing with your group members) inside directory `model_param`. The parameters file is a single file approximately 1GB in size. ## Loading AlphaFold 3 module ``` module load ml/alphafold3 ``` The `ml/alphafold3` module defines various environment variables such as `ALPHAFOLD_DIR` and `DB_DIR` that can be used to run a job as shown below. Users will have to setup environment variable for `MODEL_PARAMETERS_DIR` before running the script or it can be setup directly in the job submission script as shown below. ## Running Note - Use `python /app/alphafold/run_alphafold.py` when using the `alphafold3.sif` image from the `ml/alphafold3` module. This is different from the official instructions on the `alphafold3` github page. See the sample slurm script below. Below is a sample script to run `alphafold3` after loading `ml/alphafold3` module. It assumes the presence of `fold_input.json` in `$HOME/af_input` and saves output to `$HOME/af_output`. Please pay close attention to different options, path and variables and make changes as necessary. ``` #!/bin/bash #SBATCH --account= #SBATCH --partition=es1 #SBATCH --gres=gpu:H100:1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=14 #SBATCH --nodes=1 #SBATCH --qos=es_normal #SBATCH --time=1:30:00 module load ml/alphafold3 # export MODEL_PARAMETERS_DIR variable according to where they are saved #If model parameters are saved in your home directory export MODEL_PARAMETERS_DIR=/global/home/users/$USER/model_param #If model parameters are saved in your group directory export MODEL_PARAMETERS_DIR=/global/home/groups//model_param apptainer exec --nv --bind $HOME/af_input:/root/af_input \ --bind $HOME/af_output:/root/af_output \ --bind $MODEL_PARAMETERS_DIR:/root/models \ --bind $DB_DIR:/root/public_databases \ $ALPHAFOLD_DIR/alphafold3.sif \ python /app/alphafold/run_alphafold.py \ --json_path=/root/af_input/fold_input.json \ --model_dir=/root/models \ --db_dir=/root/public_databases \ --output_dir=/root/af_output ``` # PyTorch ## Loading PyTorch ``` module load ml/pytorch ``` PyTorch versions Use `module spider pytorch` to get information on the versions of pytorch available as modules. `module load ml/pytorch` will additionally load other dependent modules such as `cuda`. If you use jupyter server on [lrc-openondemand](../../../openondemand/overview/), pytorch kernels `torch 2.0.1` ard `torch 2.3.1` are available. ## Multi-GPU jobs A sample for a multi-GPU PyTorch code can be found on the [Distributed PyTorch tutorial](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series) examples on github. The SLURM script provided in the pytorch examples folder can be adapted to run on our cluster. The SLURM script provided below runs the `multinode.py` pytorch script on four A40 GPU cards distributed over two nodes: ``` #SBATCH --job-name=ddp_on_A40 #SBATCH --partition=es1 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --account= #SBATCH --time=01:00:00 #SBATCH --qos=es_normal #SBATCH --gres=gpu:A40:2 module load ml/pytorch allocated_nodes=$(scontrol show hostname $SLURM_JOB_NODELIST) nodes=${allocated_nodes//$'\n'/ } nodes_array=($nodes) head_node=${nodes_array[0]} echo Head Node: $head_node echo Node List: $nodes srun torchrun --nnodes 2 \ --nproc_per_node 2 \ --rdzv_id $RANDOM \ --rdzv_backend c10d \ --rdzv_endpoint $head_node:29500 \ $cwd/multinode.py 500 10 ``` # Tensorflow ## Loading Tensorflow ``` module load ml/tensorflow ``` Tensorflow versions Use `module spider tensorflow` to get information on the versions of pytorch available as modules. `module load ml/tensorflow` will load other dependent modules such as `cuda`. If you use jupyter server on [lrc-openondemand](../../../openondemand/overview/), tensorflow kernels `tf 2.15.0` ard `tf 2.14.0` are available. ## Example SLURM script The follow SLURM script shows how to run a tensorflow script on 1 H100 GPU card. ``` #!/bin/bash #SBATCH --job-name="TensorFlowCIFAR10" #SBATCH --ntasks=1 #SBATCH --cpus-per-task=14 #SBATCH --output=tf_job.out #SBATCH --error=tf_job.err #SBATCH --time=0:40:0 #SBATCH --partition=es1 #SBATCH --account= #SBATCH --qos=es_normal #SBATCH --gres=gpu:H100:1 module load ml/tensorflow srun python cifar10.py ``` # Intel MPI ## Loading Intel MPI ``` [user@n0000 ~]$ module load intel-oneapi-compilers [user@n0000 ~]$ module load intel-oneapi-mpi [user@n0000 ~]$ module list Currently Loaded Modules: 1) intel-oneapi-compilers/2023.1.0 2) intel-oneapi-mpi/2021.10.0 ``` ## Compiling MPI applications with Intel MPI Open MPI compiler wrappers `mpiicx`, `mpiicpx`, `mpiifx` can be used to compile MPI applications. For hello world C/C++/Fortran examples: Examples ``` mpiicx -o helloc hello_world.c ``` `mpiicx` is the MPI wrapper to the Intel(R) C/C++ compiler and should be used to compile and link C programs ``` mpiicpx -o hellocxx hello_world.cpp ``` `mpiicpx` is the MPI wrapper to the Intel(R) oneAPI DPC++/C++ Compiler and should be used to compile and link C++ programs ``` mpiifx -o hellofortran hello_world.f90 ``` `mpiifx` is the MPI wrapper to the Intel(R) oneAPI Fortran Compiler `ifx`. The `intel-oneapi-mpi` package also comes with MPI wrapper to the Intel Classic Compilers: `mpiicc`, `mpiicpc` and `mpiifort`. ## Running MPI applications using Intel MPI Intel MPI applications can be launched using: - `mpirun` e.g.: `mpirun -np 2 ./helloc` - `srun`: To launch an Intel MPI application using `srun`, please set the `I_MPI_PMI_LIBRARY` environment variable and pass `mpi=pmi2` argument as follows in your slurm script. ``` export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so srun --mpi=pmi2 mpi_application ``` # Open MPI ## Loading Open MPI Open MPI is installed for various compilers in the software module farm. A compiler must be loaded before you can load the corresponding `openmpi`. For example, if you load the default `gcc` through `module load gcc`, then you can see which openmpi modules are available under the `gcc` module via `module avail openmpi`: ``` [user@n0000 ~]$ module load gcc [user@n0000 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 [user@n0000 ~]$ module avail openmpi -------------- /global/software/rocky-8.x86_64/modfiles/gcc/11.4.0 ----------- openmpi/4.1.3 openmpi/4.1.6 (D) ``` After this, you can load the default `openmpi/4.1.6` through `module load openmpi` or by specifying the version `module load openmpi/4.1.6`. If you want to load the non-default `openmpi/4.1.3` module, then you must specify the version: `module load openmpi/4.1.3`: ``` [user@n0000 ~]$ module load openmpi [user@n0000 ~]$ module list Currently Loaded Modules: 1) gcc/11.4.0 2) ucx/1.14.1 3) openmpi/4.1.6 ``` ## Compiling MPI applications with Open MPI Open MPI compiler wrappers `mpicc`, `mpicxx`, `mpifort` can be used to compile MPI applications. For hello world C/C++/Fortran examples: Examples ``` mpicc -o helloc hello_world.c ``` `mpicc` is the MPI wrapper to the gcc C compiler. ``` mpicxx -o hellocxx hello_world.cpp ``` `mpicxx` is the MPI wrapper to the gcc C++ compiler. ``` mpifort -o hellofortran hello_world.f90 ``` `mpifort` is the MPI wrapper to the gfortran compiler. The `gcc/openmpi` compiled binaries can be launched directly through `srun` inside of a slrum job script. # Einsteinium GPU Cluster Einsteinium is an institutional GPU cluster that was deployed to meet the growing computational demand for researchers doing machine learning and deep learning. The system is named after the chemical element with symbol `Es` and atomic number 99 which was discovered at Lawrence Berkeley National Laboratory in 1952 and in honor of Albert Einstein who developed the theory of relativity. ## `es1` Partition `es1` is a partition consisting of multiple GPU node types to address the different research needs. These include: | Accelerator | Nodes | GPUs per Node/GPU Memory | CPU Processor | CPU Cores | CPU RAM | Infiniband | | --- | --- | --- | --- | --- | --- | --- | | NVIDIA H100 | 4 | 8x 80 GB | Intel Xeon Platinum 8480+ | 112 | 1 TB | NDR | | NVIDIA A100 | 1 | 4x 80 GB | AMD EPYC 7713 | 64 | 512 GB | HDR | | NVIDIA A40 | 30 | 4x 48 GB | AMD EPYC 7742 | 64 | 512 GB | FDR | | NVIDIA GRTX8000 | 1 | 4x 48 GB | AMD EPYC 7713 | 64 | 512 GB | HDR | | NVIDIA V100 | 15 | 2x 32 GB | Intel Xeon E5-2623 | 8 | 64GB or 192GB | FDR | H100 and CBORG Currently, we have five NVIDIA H100 nodes in our datacenter, four of which are available to users through SLURM on `es1` partition. One H100 node (8 GPUs) is used for LLM inference by [CBORG](http://cborg.lbl.gov) . ### How to specify desired GPU card(s) Due to hardware configuation, special attention is needed to ensure the ratio of CPU-core# to GPU# Examples: - Request one V100 card: `--cpus-per-task=4 --gres=gpu:V100:1 --ntasks=1` - Request two A40 cards: `--cpus-per-task=16 --gres=gpu:A40:2 --ntasks=2` - Request three H100 cards: `--cpus-per-task=14 --gres=gpu:H100:3 --ntasks=3` - Request one A100 cards: `--cpus-per-task=16 --gres=gpu:A100:1 --ntasks=1` - Request four GRTX8000 cards: `--cpus-per-task=16 --gres==gpu:GRTX8000:4 --ntasks=4` Example slurm script on `es1` Here is an example slurm script that requests one NVIDIA A40 GPU card. ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=es1 #SBATCH --qos=es_normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=16 #SBATCH --gres=gpu:A40:1 #SBATCH --time=1:00:00 module load ml/pytorch python train.py ``` ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --account=account_name #SBATCH --partition=es1 #SBATCH --qos=es_normal #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --cpus-per-task=14 #SBATCH --gres=gpu:H100:4 #SBATCH --time=1:00:00 module load ml/pytorch python train.py ``` ## `es0` Partition `es0` is a partition with NVIDIA 2080 TI GPUs that do not incur [Service Unit (SU) charges](../../#gpu-partitions-recharge-rates). | Accelerator | Nodes | GPUs per Node/GPU Memory | CPU Processor | CPU Cores | CPU RAM | Infiniband | | --- | --- | --- | --- | --- | --- | --- | | NVIDIA 2080TI | 12 | 4x 11 GB | Intel Xeon Silver 4212 | 8 | 96GB | FDR | Example slurm script on `es0` ``` #!/bin/bash #SBATCH --job-name=testes0 #SBATCH --account=account_name #SBATCH --partition=es0 #SBATCH --qos=es_normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:1 #SBATCH --time=1:00:00 module load ml/pytorch python train.py ``` ``` #!/bin/bash #SBATCH --job-name=testes0 #SBATCH --account=account_name #SBATCH --partition=es0 #SBATCH --qos=es_normal #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:4 #SBATCH --time=1:00:00 module load ml/pytorch python train.py ``` # CPU Cluster ## Lawrencium About Lawrencium Cluster Lawrencium is a general purpose cluster that is suitable for running a wide variety of scientific applications. The system is named after the chemical element 103 which was discovered at Lawrence Berkeley National Laboratory in 1958 and in honor of Ernest Orlando Lawrence, the inventor of the cyclotron. The original Lawrencium system was built as a 200-node cluster and debuted as #500 on the Top500 supercomputing list in Nov 2008. Lawrencium consists of multiple generations of compute nodes with the `lr8` partition being the most recent addition and the `lr4` partition the oldest still in production. In addition, there is a `lr_bigmem` partition with 1.5TB memory per node, and `cm1, cm2, cf1` partitions (details in the table below). | Partition | Nodes | CPU | Cores | Memory | Infiniband | | --- | --- | --- | --- | --- | --- | | lr8 | 20 | AMD EPYC 9534 | 128 | 768GB | HDR | | lr7 | 132 | Intel Xeon Gold 6330 | 56 | 256GB or 512GB | HDR | | lr6 | 88 | Intel Xeon Gold 6130 | 32 | 96GB or 128GB | FDR | | | 156 | Intel Xeon Gold 5218 | 32 | 96GB | FDR | | | | Intel Xeon Gold 6230 | 40 | 128GB | FDR | | lr5 | 192 | Intel Xeon E5-2680v4 | 28 | 64GB | FDR | | | | Intel Xeon E5-2640v4 | 20 | 128GB | QDR | | lr4 | 148 | Intel Xeon E5-2670v3 | 24 | 64GB | FDR | | lr_bigmem | 2 | Intel Xeon Gold 5218 | 32 | 1.5TB | EDR | | cm1 | 14 | AMD EPYC 7401 | 48 | 256GB | FDR | | cm2 | 3 | AMD EPYC 7454 | 64 | 256GB | EDR | | cf1 | 72 | Intel Xeon Phi 7210 | 256 | 192GB | FDR | LRC Jobscript Generator You can use the [LRC Jobscript Generator](https://lbnl-science-it.github.io/lrc-jobscript/src/lrc-calculator.html) page to generate sample slurm job submission scripts targeting these different systems. # ALSACC - Advanced Light Source The ALSACC cluster is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage and backend network management. ## Login and Data Transfer ALSACC uses One Time Password (OTP) for login authentication for all the services provided below. Please also refer to the Data Transfer page for additional information. - Login server: `lrc-login.lbl.gov` - DATA transfer server: `lrc-xfer.lbl.gov` - Globus Online endpoint: `lbnl#lrc` ## Hardware Configuration ALSACC cluster has a mixture of different CPU architectures and memory configurations so please be aware of them and choose them wisely along with the scheduler configurations. | Partition | Nodes | Node List | CPU | Cores | Memory | Infiniband | | --- | --- | --- | --- | --- | --- | --- | | alsacc | 64 | n00[00-27].alsacc0 | Intel Xeon X5650 | 12 | 24GB | QDR | | | | n00[28-43].alsacc0 | Intel Xeon E5-2670 | 16 | 64GB | FDR | | | | n00[44-55].alsacc0 | Intel Xeon E5-2670v2 | 20 | 64GB | FDR | | | | n00[56-63].alsacc0 | Intel Xeon E5-2670v3 | 24 | 64GB | FDR | ## Storage and Backup ALSACC cluster users are entitled to access the following storage systems so please get familiar with them. | Name | Location | Quota | Backup | Allocation | Description | | --- | --- | --- | --- | --- | --- | | HOME | `/global/home/users/$USER` | 12GB | Yes | Per User | HOME directory for permanent data storage | | GROUP-SW | `/global/home/groups-sw/$GROUP` | 200GB | Yes | Per Group | GROUP directory for software and data sharing with backup | | GROUP | `/global/home/groups/$GROUP` | 400GB | No | Per Group | GROUP directory for data sharing without backup | | SCRATCH | `/global/scratch/$USER` | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system | | CLUSTERFS | `/clusterfs/alsacc/$USER` | none | No | Per User | Private storage | Note HOME, GROUP, and GROUP-SW directories are located on a highly reliable enterprise level BlueArc storage device. Since this appliance also provides storage for many other mission critical file systems, and it is not designed for high performance applications, running large I/O dependent jobs on these file systems could greatly degrade the performance of all the file systems that are hosted on this device and affect hundreds of users, thus this behavior is explicitly prohibited. HPCS reserves the right to kill these jobs without notification once discovered. Jobs that have I/O requirement should use the SCRATCH file system which is designed specifically for that purpose. ## Scheduler Configuration ALSACC cluster uses [SLURM](../../../running/slurm-overview/) as the scheduler to manage jobs on the cluster. To use the ALSACC resource, the partition `alsacc` must be used (`--partition=alsacc`) along with account `alsacc` (`--account=alsacc`). Currently there is no special limitation introduced to the `alsacc` partition thus no QoS configuration is required to use the ALSACC resources (a default QoS will be applied automatically). A standard fair-share policy with a decay half life value of 14 days is enforced. The job allocation on ALSACC is shared i.e. a node can be shared between multiple jobs. The different QoS arguments and their limits are shown below: | Node List | Node Features | | --- | --- | | n00[00-27].alsacc0 | alsacc, alsacc_c12 | | n00[28-43].alsacc0 | alsacc, alsacc_c16 | | n00[44-55].alsacc0 | alsacc, alsacc_c20 | | n00[56-63].alsacc0 | alsacc, alsacc_c24 | ## Software Configuration ALSACC uses [Software Module Farm](../../../software/software-module-farm/) to [manage](../../../software/module-management/) the cluster-wide software installation. ## Cluster Status Please visit [here](https://metacluster.lbl.gov/warewulf/alsacc0) for the live status of ALSACC cluster. # CATAMOUNT - Material Sciences Division The CATAMOUNT cluster is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage, and backend network management. ## Login and Data Transfer CATAMOUNT uses One Time Password (OTP) for login authentication for all the services provided below. Please also refer to the Data Transfer page for additional information. - Login server: `lrc-login.lbl.gov` - DATA transfer server: `lrc-xfer.lbl.gov` - Globus Online endpoint: `lbnl#lrc` ## Hardware Configuration Each compute node has dual-socket octa-core Intel Xeon E5-2670 (Sandy Bridge) @ 2.60 GHz processors (16 cores in total), 64 GB of physical memory. Compute nodes are connected with each other through multiple high performance Mellanox 56 Gbps FDR Infiniband edge switches then back to a backbone Mellanox SX6518 director switch. | Partition | Nodes | Node List | CPU | Cores | Memory | | --- | --- | --- | --- | --- | --- | | catamount | 116 | n0[000-115].catamount0 | Intel Xeon E5-2670 | 16 | 64GB | ## Storage and Backup CATAMOUNT cluster users are entitled to access the following storage systems so please get familiar with them. | Name | Location | Quota | Backup | Allocation | Description | | --- | --- | --- | --- | --- | --- | | HOME | `/global/home/users/$USER` | 12GB | Yes | Per User | HOME directory for permanent data storage | | GROUP-SW | `/global/home/groups-sw/$GROUP` | 200GB | Yes | Per Group | GROUP directory for software and data sharing with backup | | GROUP | `/global/home/groups/$GROUP` | 400GB | No | Per Group | GROUP directory for data sharing without backup | | SCRATCH | `/global/scratch/$USER` | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system | | CLUSTERFS | `/clusterfs/catamount/$USER` | none | No | Per User | Private storage | Note HOME, GROUP, and GROUP-SW directories are located on a highly reliable enterprise level BlueArc storage device. Since this appliance also provides storage for many other mission critical file systems, and it is not designed for high performance applications, running large I/O dependent jobs on these file systems could greatly degrade the performance of all the file systems that are hosted on this device and affect hundreds of users, thus this behavior is explicitly prohibited. HPCS reserves the right to kill these jobs without notification once discovered. Jobs that have I/O requirement should use the SCRATCH file system which is designed specifically for that purpose. ## Scheduler Configuration CATAMOUNT cluster uses [SLURM](../../../running/slurm-overview/) as the scheduler to manage jobs on the cluster. To use the CATAMOUNT resource, the partition `catamount` must be used (`--partition=catamount`) along with account `catamount` (`--account=catamount`). One of the QoSs from the following table should be used as well (e.g., `--qos=cm_short`). A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. The job allocation on CATAMOUNT is exclusive i.e. a node is **not** shared between two jobs. The different QoS arguments and their limits are shown below: | QOS | QOS Limit | | --- | --- | | cm_short | 4 nodes max per job; 24:00:00 wallclock limit | | cm_medium | 16 nodes max per job; 72:00:00 wallclock limit | | cm_long | 32 nodes max per user | | cm_debug | 2 nodes max per job; 8 nodes in total; 00:30:00 wallclock limit | ## Software Configuration CATAMOUNT uses [Software Module Farm](../../../software/software-module-farm/) to [manage](../../../software/module-management/) the cluster-wide software installation. # Catscan The Catscan cluster is an XCAT stand alone cluster. **Login node**: catscan.lbl.gov Sports a 269Tb zfs filesystem for data and computational scratch space on /pool0 (see below) Compute nodes: n0000, n0001, n0002, n0003 ## Cluster Configuration | Node | Access | Storage | Filesystems | Description of Use | CPU | CORES | MEMORY | GPU | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | catscan.lbl.gov | ssh with either LDAP credentials or password provided by administrator | /home: local drive, 211G /pool0 ZFS filesystem, 269T | /clusterfs/bebb/users /clusterfs/bebb/group-sw | Login node | Intel(R) Xeon(R) Gold 6126 | 48 (HT Enabled) | 196 GB | N/A | | n000[0-1] | ssh from catscan.lbl.gov with cluster key | As above via nfs | As above via nfs | Compute node | Intel(R) Xeon(R) Gold 6126 | 48 (HT Enabled) | 196 GB | 4x NVIDIA GeForce GTX 1080 Ti | | n000[2-3] | ssh from catscan.lbl.gov with cluster key | As above via nfs | As above via nfs | Compute node | | 24 | 188 GB | 2x NVIDIA RTX A4500 | **Operating System**: CentOS Linux release 7.9.2009 (Core) **Nvidia driver & NVRM version**: NVIDIA UNIX x86_64 Kernel Module 510.73.05 Sat May 7 05:30:26 UTC 2022 Each compute node has 4 GPUS: ``` CUDA Driver Version / Runtime Version 10.1 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 11178 MBytes (11721506816 bytes) (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1582 MHz (1.58 GHz) Memory Clock rate: 5505 Mhz Memory Bus Width: 352-bit L2 Cache Size: 2883584 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 94 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > ``` ## Queue Configuration At the moment, there is no resource manager/scheduler. Depending on how the resources are used, this may change. ## Additional Notes ### Authentication Currently authentication is LDAP -- the same credentials that you would use to access gmail.lbl.gov. We are in the process of evaluating OTP over ssh keys, or simply OTP. ### Accessing the Compute nodes You will need to generate ssh-keys for intra-cluster access to the compute nodes. To do this run: ``` ssh-keygen -t ed25519 ``` Then run ``` cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys ``` Then finally ``` chmod 600 ~/.ssh/authorized_keys ``` The suggested default name and location should be fine. You will be prompted for a password, but to leave it blank just hit enter. Passwords can interfere with intra-cluster node communication when launching jobs, particularly with a scheduler should we choose to deploy one. ## Storage for Data Each user will be granted space under `/clusterfs/bebb/users` for data. There is also a group directory to which the group has write permission. This directory should be used for custom builds of software of which the group may want to take advantage. This is modeled after what we offer on our clusters in 1275, but this is your cluster so you can choose to use it how you desire. ## Software Module Farm See [Documentation on using "modules"](../../../software/module-management/). ### Apptainer (formerly known as Singularity) Some of you have expressed the need to import custom software built on different architectures. Instead of re-inventing this on the catscan architecture, you can use Apptainer. Apptainer enables users to have full control of their environment. Apptainer can be used to package entire scientific workflows, software and libraries, and even data. To get started, check out these links: [Documentation on using Singularity on a cluster called Savio that we manage on UCB cluster](http://research-it.berkeley.edu/services/high-performance-computing/using-singularity-savio) # DIRAC1 Cluster The DIRAC1 cluster is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage and backend network management. ## Login and Data Transfer DIRAC1 uses [One Time Password (OTP)](../../../accounts/mfa/) for login authentication for all the services provided below. - [Login server](../../../accounts/loggingin/): `lrc-login.lbl.gov` - [DATA transfer server](../../../data-transfer-node/): `lrc-xfer.lbl.gov` - [Globus Online endpoint](../../../../data/globus/): `lbnl#lrc` ## Hardware Configuration DIRAC1 has a mixture of different type of hardware. Compute nodes are connected with each other through a high-performance Mellanox 56Gbps FDR switch. | PARTITION | NODES | NODE LIST | CPU | CORES | MEMORY | | --- | --- | --- | --- | --- | --- | | dirac1 | 56 | n0[000-055].dirac1 | INTEL XEON E5-2670 v3 | 24 | 256GB | | dirac1 | 72 | n0[056-127].dirac1 | INTEL XEON E5-2650 v4 | 24 | 256GB | ## Storage and Backup: DIRAC1 cluster users are entitled to access the following storage systems so please get familiar with them. | NAME | LOCATION | QUOTA | BACKUP | ALLOCATION | DESCRIPTION | | --- | --- | --- | --- | --- | --- | | HOME | /global/home/users/$USER | 12GB | Yes | Per User | HOME directory for permanent data storage | | GROUP-SW | /global/home/groups-sw/$GROUP | 200GB | Yes | Per Group | GROUP directory for software and data sharing with backup | | GROUP | /global/home/groups/$GROUP | 400GB | No | Per Group | GROUP directory for data sharing without backup | | SCRATCH | /global/scratch/users/$USER | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system | | CLUSTERFS | /clusterfs/dirac1/$USER | none | No | Per User | Private storage | Note HOME, GROUP, GROUP-SW and CLUSTERFS directories are located on a highly reliable enterprise level BlueArc storage device. Since this appliance also provides storage for many other mission critical file systems, and it is not designed for high performance applications, running large I/O dependent jobs on these file systems could greatly degrade the performance of all the file systems that are hosted on this device and affect hundreds of users, thus this behavior is explicitly prohibited. HPCS reserves the right to kill these jobs without notification once discovered. Jobs that have I/O requirement should use the SCRATCH file system which is designed specifically for that purpose. ## Scheduler Configuration: DIRAC1 cluster uses [SLURM](../../../running/slurm-overview/) as the scheduler to manage jobs on the cluster. To use the DIRAC1 resource the partition `dirac1` must be used (`--partition=dirac1`) along with account `dirac1` (`--account=dirac1`). A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. For users from `ac_ciftgp` (Guggenheim Users), the default QOS is `dirac1_normal`. This is the high priority QOS which will preempt low priority jobs if those jobs need resources. For users from `ac_cifgres`, the default QOS is `dirac1_lowprio`. This is the low priority QOS. When system is busy and there are higher priority jobs pending, scheduler will preempt jobs that are running with this low priority QOS. `dirac1_highprio` is a special QoS which will preempt both `dirac1_lowprio` and `dirac1_normal`. | PARTITION | ACCOUNT | NODES | NODE LIST | NODE FEATURES | SHARED | QOS | QOS LIMIT | | --- | --- | --- | --- | --- | --- | --- | --- | | dirac1 | dirac1 | 128 | n0[000-127].dirac1 | dirac1 | Exclusive | dirac1_lowprio dirac1_normal dirac1_highprio | no limit no limit no limit | ## Software Configuration DIRAC1 uses Software Module Farm Environment Modules to manage the cluster wide software installation. ## Cluster Status Please visit [here](http://metacluster.lbl.gov/warewulf/dirac1) for the live status of DIRAC1 cluster. ## Additional Information Please send us tickets to hpcshelp@lbl.gov or send email to ScienceIT@lbl.gov for any inquiries or service requests. # ETNA Cluster The ETNA cluster is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage, and backend network management. ## Login and Data Transfer ETNA uses One Time Password (OTP) for login authentication for all the services provided below. Please also refer to the Data Transfer page for additional information. - Login server: `lrc-login.lbl.gov` - DATA transfer server: `lrc-xfer.lbl.gov` - Globus Online endpoint: `lbnl#lrc` ## Hardware Configuration Each compute node has dual-socket 12-core INTEL Xeon E5-2670 v3 @ 2.30 GHz processors (24 cores in total), 64 GB of physical memory. Compute nodes are connected with each other through a high performance Mellanox 56 Gbps FDR Infiniband fabric. | PARTITION | NODES | CPU | CORES | MEMORY | GPU | | --- | --- | --- | --- | --- | --- | | etna | 170 | INTEL XEON E5-2670 v3 | 24 | 64GB | – | | etna | 3 | INTEL XEON E5-2670 v3 | 24 | 64GB | Xeon Phi | | etna | 16 | INTEL XEON E5-2623 v3 | 8 | 64GB | K80, V100 | ## Storage and Backup ETNA cluster users are entitled to access the following storage systems so please get familiar with them. | NAME | LOCATION | QUOTA | BACKUP | ALLOCATION | DESCRIPTION | | --- | --- | --- | --- | --- | --- | | HOME | `/global/home/users/$USER` | 12GB | Yes | Per User | HOME directory for permanent data storage | | SCRATCH | `/global/scratch/users/$USER` | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system over Infiniband | | MOTEL | `/clusterfs/vulcan/motel/$USER` | none | No | Per User | Long-term storage of bulk data | | MOTEL2 | `/clusterfs/vulcan/motel2/$USER` | none | No | Per User | Long-term storage of bulk data | | PSCRATCH | `/clusterfs/etna/pscratch/$USER` | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system over Infiniband | Note `HOME`, `MOTEL`, and `MOTEL2` directories are located on a highly reliable enterprise level BlueArc storage device. Since this appliance also provides storage for many other mission critical file systems, and it is not designed for high performance applications, running large I/O dependent jobs on these file systems could greatly degrade the performance of all the file systems that are hosted on this device and affect hundreds of users, thus this behavior is explicitly prohibited. HPCS reserves the right to kill these jobs without notification once discovered. Jobs that have I/O requirement should use the `SCRATCH` or `PSCRATCH` file system which are designed specifically for that purpose. ## Scheduler Configuration ETNA cluster uses SLURM as the scheduler to manage jobs on the cluster. To use the ETNA resource the partition `etna` must be used (`–partition=etna`). Users of projects `nano` and `etna` are allowed to submit jobs to the ETNA cluster using either `--account=etna` or `--account=nano`; details on checking your slurm associations are [here](../../../running/slurm-overview/#slurm-association). For the GPU nodes, use `–partition=etna_gpu`. Currently there is no special limitation introduced to the `etna` partition thus no QoS configuration is required to use the ETNA resources (a default `normal` QoS will be applied automatically). A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. | PARTITION | NODES | NODE LIST | NODE FEATURES | SHARED | | --- | --- | --- | --- | --- | | etna | 175 | n0[000-174].etna0 | etna, etna_phi([172-174]) | Exclusive | | etna_gpu | 16 | n0[175-183,238,299-304].etna0 | etna_gpu, etna_k80, etna_v100, etna_v100_32g | Shared | | etna-shared | 5 | n0[188-192].etna0 | etna_share | Shared | | etna_c40 | 16 | n0[239-254].etna0 | etna_c40 | Exclusive | | etna_bigmem | 47 | n0[186-187,193-237].etna0 | etna_bigmem | Exclusive | ## Software Configuration ETNA uses [Software Module Farm](../../../software/software-module-farm/) to [manage](../../../software/module-management/) the cluster-wide software installation. ## Cluster Status Please visit [here](https://metacluster.lbl.gov/warewulf/etna0) for the live status of ETNA cluster. # MHG Cluster The MHG cluster is part of the LBNL Supercluster and shares the same Supercluster infrastructure. This includes the system management software, software module farm, scheduler, storage, and backend network management. ## Login and Data Transfer MHG uses One Time Password (OTP) for login authentication for all the services provided below. Please also refer to the Data Transfer page for additional information. - Login server: `lrc-login.lbl.gov` - DATA transfer server: `lrc-xfer.lbl.gov` - Globus Online endpoint: `lbnl#lrc` ## Hardware Configuration MHG cluster has a mixture of different CPU architectures and memory configurations so please be aware of them and choose them wisely along with the scheduler configurations. | PARTITION | NODES | NODE LIST | CPU | CORES | MEMORY | | --- | --- | --- | --- | --- | --- | | mhg | 72 | n0[030-036,041-055].mhg0 | AMD Opteron 6376 | 64 | 256 GB | | | | n0[037-040,082,084].mhg0 | AMD Opteron 6376 | 64 | 512 GB | | | | n0[056-081,083,085-101].mhg0 | AMD Opteron 6274 | 64 | 256 GB | ## Storage and Backup MHG cluster users are entitled to access the following storage systems so please get familiar with them. | NAME | LOCATION | QUOTA | BACKUP | ALLOCATION | DESCRIPTION | | --- | --- | --- | --- | --- | --- | | HOME | `/global/home/users/$USER` | 12GB | Yes | Per User | HOME directory for permanent data storage | | GROUP-SW | `/global/home/groups-sw/$GROUP` | 200GB | Yes | Per Group | GROUP directory for software and data sharing with backup | | GROUP | `/global/home/groups/$GROUP` | 400GB | No | Per Group | GROUP directory for data sharing without backup | | SCRATCH | `/global/scratch/users/$USER` | none | No | Per User | SCRATCH directory with Lustre high performance parallel file system | | CLUSTERFS | `/clusterfs/mhg/$USER` | none | No | Per User | Private storage | | LOCAL | `/local/scratch/users/$USER` | none | No | Per User | Local scratch on each node | Note HOME, GROUP, and GROUP-SW directories are located on a highly reliable enterprise level BlueArc storage device. Since this appliance also provides storage for many other mission critical file systems, and it is not designed for high performance applications, running large I/O dependent jobs on these file systems could greatly degrade the performance of all the file systems that are hosted on this device and affect hundreds of users, thus this behavior is explicitly prohibited. HPCS reserves the right to kill these jobs without notification once discovered. Jobs that have I/O requirement should use the SCRATCH file system which is designed specifically for that purpose. ## **Scheduler Configuration:** MHG cluster uses [SLURM](https://it.lbl.gov/resource/hpc/for-users/hpc-documentation/running-jobs/) as the scheduler to manage jobs on the cluster. To use the MHG resource the partition `mhg` must be used (`--partition=mhg`) along with account `mhg` (`--account=mhg`). Currently there is no special limitation introduced to the `mhg` partition thus no QoS configuration is required to use the MHG resources (a default `normal` QoS will be applied automatically). A standard fair-share policy with a decay half life value of 14 days (2 weeks) is enforced. If node feature (`--constraint` option) is not used, the default dispatch order will be: `mhg_c4, mhg_c8, mhg_c32, mhg_c48, mhg_m256, mhg_m512`. | PARTITION | ACCOUNT | NODES | NODE LIST | NODE FEATURES | SHARED | QOS | QOS LIMIT | | --- | --- | --- | --- | --- | --- | --- | --- | | mhg | mhg | 72 | n0[030-036].mhg0 n0[037-039].mhg0 n0040.mhg0 n0[041-055].mhg0 n0[056-101].mhg0 | mhg, mhg_c64, mhg_m256 mhg, mhg_c64, mhg_m512 mhg, mhg_c64, mhg_m512, mhg_ssd mhg, mhg_c64, mhg_m256 mhg, mhg_c64, mhg_m256 | Yes | normal | no limit | ## Software Configuration ETNA uses [Software Module Farm](../../../software/software-module-farm/) to [manage](../../../software/module-management/) the cluster-wide software installation. ## Cluster Status Please visit [here](https://metacluster.lbl.gov/warewulf/mhg) for the live status of MHG cluster. ## **Additional Information:** Please send us tickets hpcshelp@lbl.gov or send email to ScienceIT@lbl.gov for any inquiries or service requests. # Data Transfer # Using the Globus AWS S3 Connector In order to setup Globus to access a S3 bucket, you'll need to have an IAM access key ID and a secret key ready to go. Due to Globus's implementation of the connector you can only add a single IAM access key ID and secret key to your Globus configuration, however you'll have access to any buckets that IAM access key ID is configured to have access to. Please note that for your first time setting up the S3 connector you’ll have to go through various “consent” and “authorization” prompts, and those steps are not documented here. Giving consent is a standard part of the Globus process whereby you authorize Globus to perform additional privileged operations with the selected endpoint. If you’ve already given permissions to Globus for the S3 connector, you might not see the consent steps. Note This guide assumes you’ve already setup a S3 bucket and configured the IAM access permissions to that bucket. If you need help doing that, see the [AWS S3 documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example1.html) for more information. ## Select Globus S3 Endpoint - Login with Globus at [https://globus.lbl.gov](https://globus.lbl.gov/). - Select "Endpoints" from the left navigation. - Enter "LBNL AWS S3 Collection" into the search textbox. - From the list of endpoints that appear, click on the "LBNL AWS S3 Collection" link. ## Setup Credentials - Click on the "Credentials" tab of the "LBNL AWS S3 Collection" endpoint page. - Here is where you register your AWS IAM access key ID and secret key with Globus. - After you've entered them, click the "Continue" button, and you'll be taken back to the full "Credentials" tab where you can see your saved AWS access credentials. - At this point you are set up to access the S3 buckets with Globus. Click the "Overview" tab, and then the "Open in File Manager" button to see the S3 buckets and the data that are available using your AWS credentials. ## Create Guest Collection from LBNL AWS S3 Collection - Click on the "Collections" tab of the "LBNL AWS S3 Collection" endpoint, and then click on the "Add a Guest Collection". - Enter in the path to the top level folder you want visible in your collection in the “Directory” field (or enter “/” to use all buckets available to those credentials in your collection). You can also click the “Browse” button to get a directory view and select the bucket or subfolder folder you want. - Enter a value for the “Display Name” field, and then click the “Create Collection” button. # Using the Globus Google Cloud Storage Connector These are the instructions for configuring Globus to access a Google Cloud Storage bucket using your regular LBL account credentials. For this connector, unlike the S3 connector, you don’t use an access token + secret key or a service account setup. You’ll authorize the Google Cloud Storage connector to use your LBL account and from that you’ll have access to any buckets that your LBL account has access to. Please note that for your first time setting up the Google Cloud Storage connector you’ll have to go through various “consent” and “authorization” prompts, and those steps are not documented here. Giving consent is a standard part of the Globus process whereby you authorize Globus to perform additional privileged operations with the selected endpoint. If you’ve already given permissions to Globus for the Google Cloud Storage connector, you might not see the consent steps. ## Select Globus Google Cloud Storage Endpoint - Login with Globus at - Select "Endpoints" from the left navigation. - Enter "LBNL Google Cloud" into the search textbox. - From the list of endpoints that appear, click on the "LBNL Google Cloud Storage Collection" link. ## Setup Credentials - Click on the "Credentials" tab of the "LBNL Google Cloud Storage Collection" endpoint page. - Here you authenticate your LBL credentials for the connector. - After you've authenticated your account, click the "Continue" button, and you'll be taken back to the full "Credentials" tab where you can see your active LBL account credentials. Note One important thing to note is that due to the required Globus configuration for the “LBNL Google Cloud Storage Collection” you will be unable to view the “File Manager” from the root of the main collection (you’ll see an error message if you try) so you must use a Guest Collection to view files in your buckets. At this point you are authenticated and ready to add one or more Guest Collections to access Google Cloud Storage buckets with Globus. ## Create Guest Collection from Google Cloud Storage Connector - Click on the "Collections" tab of the "LBNL Google Cloud Storage Connector" endpoint, and then click on the "Add a Guest Collection". - Enter in the name of the bucket and any pathing within that bucket that you want to use as the top-level folder of your collection in the “Directory” field. Due to the required Globus configuration of the “LBNL Google Cloud Storage Collection” you will be unable to “Browse” the directory, so you must enter an existing bucket name in the directory field. Enter any value for the “Display Name” field, and then click the “Create Collection” button. # Globus for Google Drive - Search for [LBNL Gdrive Access](https://globus.lbl.gov/file-manager/collections/37286b85-fa2d-41bd-8110-f3ed7df32d62/) (or use the link just above) and click on the endpoint. ## Creating a Guest Collection in Gdrive - Click on the "Collections" tab and authenticate if prompted by hitting the "Continue" button. Consent to allow Globus to manage the collection - Click Add a Guest Collection - Enter the drive path you want to view and select a display name for the collection, then click Create Collection. - Return to the Endpoint tab and search for your new collection by the Display Name you gave it. - Open the new collection in the File Manager, and you are ready to transfer files to or from your LBNL Gdrive Collection! ## Creating a Shared Collection in Gdrive - Find the "LBNL Gdrive Access" Endpoint in the Globus UI - Click on the Collections Tab and click "Add a Guest Collection" - Browse to the folder in your Google Drive that you wish to share and give it a Display Name. Click Create Collection. - Select Add Permissions - Share With (Here the "/" is the root of the folder you browsed to when creating this share. In this example, it is "My Drive/Fernsler's Demo Share/" in user fernsler's LBNL Google Drive) - Enter the name/email of a collaborator you wish to share with, and give them appropriate permissions. By default, read only access to shares is active. Click Add Permission to create the share. Globus will send the user you are sharing the folder with an email and they will be able to find the Collection you shared by its Display Name in the endpoint tab search (in this example, it is “Share with Wei”) # Globus for Lawrencium ## Login - Open a browser and navigate to . - You may choose your institution from the drop-down list. If your institution is not listed, you may Sign in using Google or ORCID id. Ideally, the email associated with your Lawrencium user account shall be used to Sing in to Globus. - If you choose LBNL as your organization, you will be directed to enter your LBNL credentials. Enter your username and password. - Enter OTP ## Search Endpoint - Once authenticated, enter the endpoint name in the collection search bar - For example, search for `lbnl#lrc` to find lawrencium's GridFTP endpoint - Scroll down to find the `lbnl#lrc` endpoint owned by kmfensler@lbl.gov. - When the endpoint is selected, it will appear as a collection in a file manager. By default, the Globus UI will display your `$HOME` directory. Type a file path name in the navigation bar next to **Path** to navigate to a different directory (i.e. `/global/scratch/`) - Click on the three dots next to the collection to view its properties. Globus is a free data transfer and storage service that lets you efficiently, securely, reliably, and quickly move large amounts of data between different resources (e.g., a personal computer, the Lawrencium cluster, Google Drive, Cloud Storage, and others) and to also share data on those resources with others. Globus addresses many of the common challenges faced by researchers in moving, sharing, and archiving large volumes of data. With Globus, you hand-off data movement tasks to a hosted service that manages the entire operation, monitoring performance and errors, retrying failed transfers, correcting problems automatically whenever possible, and reporting status to keep you informed of the process. Science IT provides many Globus endpoints to help automate data transfers. Globus endpoints are authenticated against your active LBL account credentials, however some endpoints like Lawrencium or Cloud Storage might require additional credentials or authentication methods to function properly. LBL's main Globus UI is available at . Current managed endpoints with general availability are: | Name | Globus Endpoint Name | Documentation | | --- | --- | --- | | Google Drive | [LBNL Gdrive Access](https://globus.lbl.gov/file-manager/collections/37286b85-fa2d-41bd-8110-f3ed7df32d62/overview) | [Globus for Google Drive](../globus-google-drive/) | | Lawrencium | [lbnl#lrc](https://globus.lbl.gov/file-manager/collections/45afb626-a4bd-11e8-96f0-0a6d4e044368/overview) | [Globus for Lawrencium](../globus-instructions/) | | Amazon Web Services S3 | [LBNL AWS S3 Collection](https://globus.lbl.gov/file-manager/collections/9c6d5242-306a-4997-a69f-e79345086d68/overview) | [Using the Globus AWS S3 Connector](../globus-aws-s3-connector/) | | Google Cloud Storage | [LBNL Google Cloud Storage Collection](https://globus.lbl.gov/file-manager/collections/54047297-0b17-4dd9-ba50-ba1dc2063468/overview) | [Using the Globus Google Cloud Storage Connector](../globus-google-cloud-storage-connector/) | If you are interested in using Google Cloud or Amazon, please reach out to [scienceit@lbl.gov](mailto:scienceit@lbl.gov) for more information on setting up a GCP or AWS account. ## Setting up a Globus Connect Personal Endpoint Even when there is not an LBL managed endpoint available it can still be useful to have access to the transfer and retry features of Globus. You can do this using Globus Connect Personal to configure an endpoint on your personal device. In general, it is always faster to use endpoints managed by LBL, but Globus Connect Personal can be useful for transfer to or from a local laptop or computer. You can find instructions for downloading and installing the Globus Connect Personal on the [Globus web site](https://docs.globus.org/globus-connect-personal/) . Globus for data transfers is highly recommended. Globus is available as a free service for any user to sign up. Please follow the [instructions](../globus-instructions/) for access setup. If you see a connector you would like us to support, please send email to [hpcshelp@lbl.gov](mailto:hpcshelp@lbl.gov). # Cloud # Cloud Services The IT Division encourages the use of cloud computing for science as an effective way to to meet specific research and computing needs. Many researchers have found that cloud computing works well for quickly bringing up computing infrastructure and that it scales well for bursty workloads such as data releases. Others have found that cloud tools for machine learning, search and workflow orchestration are very effective. To facilitate cloud use at Berkeley Lab, Science IT has secured master payer contracts with Amazon Web Services and Google Cloud Platform. These master payer agreements provide LBNL scientists, staff, and other affiliated researchers with access to the full suite of AWS and GCP services without use of a purchase order or need to create a contract vehicle. This is considered a self-managed cloud offering as you are expected to manage your own usage of services in your chosen cloud environment. Some of the benefits of using cloud computing services through the Lab’s Master Payer agreement are: - LBNL and DOE have negotiated better pricing and significant discounts for both AWS and GCP as compared to regular commercial list pricing. - Cost savings and discounts are automatically applied as part of the program. - There are no fees or charges for having an account on either service You only pay for the services and resources that are actually in use on either service - An 11% overhead fee is charged (procurement pass-through burden) - Charges are applied monthly via recharge using a standard Project ID (PID) - You have access to Account Management, Cloud Solution Architects, and Subject Matter experts through the Science IT Consultants, who can assist with how to use cloud services, optimize costs, and provide technical guidance on best practices, but do not provide an ongoing service model. - Science IT works with AWS and GCP provides various trainings throughout the year and also works to provide access to experts for consultations # Accessing Cloud Services The process for getting a cloud account (AWS) or project (Google Cloud) are different. Users requesting an AWS account will need to work with the ScienceIT cloud team to get an account, and users needing a Google Cloud project can create their own projects after completing an initial 30-45 minute discussion with the ScienceIT cloud team. # Requesting an AWS Account at LBNL This document outlines the process for LBNL users to request and access an Amazon Web Services (AWS) account. **Important:** LBNL users cannot self-register AWS accounts using their LBL identity. All AWS account requests must be submitted through the Science IT cloud team. ## Prerequisites - You must be a LBNL staff member to get a LBNL AWS account. Interns and external collaborators are not eligible for an AWS account through LBNL, however a LBNL staff member can be the point of contact and can provide access to an AWS account through IAM logins to interns and external collaborators ## Requesting an AWS Account 1. **Email the Science IT Cloud Team:** - Send an email to `scienceit@lbl.gov` requesting an AWS account. 1. **Provide Necessary Information:** - In your email, include the following information: - Indicate if you or someone else will be the "owner" and contact for the account. This information is needed by both the LBNL cloud team and Cybersecurity. - A Project ID for recharges. 1. **Account Creation:** - The Science IT cloud team will setup a time to meet with the account owner to create and configure the AWS account. - Once the account is created, the setup and enabling of MFA is required before the account can be used. ## Enabling Multi-Factor Authentication (MFA) After your AWS account is created, at that time you **must** enable Multi-Factor Authentication (MFA) to access and use any AWS services and resources in the account. This is a security requirement for all LBNL AWS accounts. For MFA, you can use a physical hardware token, an Authenticator app, or a Passkey. 1. **Log in to the AWS Management Console:** - Use the credentials provided by the Science IT cloud team to log in to the AWS Management Console: [aws.amazon.com/console](https://aws.amazon.com/console). 1. **Navigate to Security Credentials:** - Select the drop down indicated by the account name in the upper right corner of the page. - Click the "Security credentials" tab. 1. **Assign MFA device:** - In the "Multi-factor authentication (MFA)" section, click "Assign MFA device". 1. **Choose MFA device type - Authenticator app option:** - Select "Virtual MFA device" and click "Continue". 1. **Install an Authenticator App:** - You will need to install an authenticator app on your smartphone or computer. Popular options include: - Google Authenticator - Authy - Microsoft Authenticator - Scan the QR code displayed on the AWS screen with your authenticator app, or manually enter the secret key. 1. **Enter MFA Codes:** - Enter the two consecutive MFA codes generated by your authenticator app into the AWS console and click "Assign MFA". 1. **Cleanup:** - You will need to log out of the account, and then log back in using MFA in order for AWS services and resources to become available. ## Important Notes - **Account Management:** The Science IT cloud team manages all LBNL AWS accounts. - **Security:** MFA is mandatory for the root user on all LBNL AWS accounts. - **Support:** For any questions or assistance, contact the Science IT cloud team at `scienceit@lbl.gov`. # Creating a Google Cloud Project This document outlines the steps for users to create a LBNL Google Cloud users to create a new Google Cloud project, ensuring proper billing, organization, and location settings. ## Prerequisites - You must use your LBNL email identity to create a Google Cloud project within the `lbl.gov` organization. - While any LBNL user can create a Google Cloud, you must have access to the "LBNL" Billing Account in order to create and run resources and services within the account. - If you create a Google Cloud project **without** selecting a Billing Account, you won't be able to create any resources in your project and instead you'll be prompted to attach your project to a Billing Account. - Access to the "LBNL" Billing Account is provided to users upon request and after a short discussion with the LBNL Cloud Team about the parameters and rules of using Google Cloud. ## Step-by-Step Instructions 1. **Navigate to the Google Cloud Console:** - Ensure you are logged into Google with your LBNL account. - Open your web browser and go to the Google Cloud Console: [console.cloud.google.com](https://console.cloud.google.com). 1. **Create a New Project:** - Click on the project selector dropdown at the upper left area on the top of the page (it usually displays "Select a project" or has the name of another existing project displayed). - Click the "New Project" button. 1. **Project Details:** - **Project Name:** Enter a descriptive name for your project. Choose a name that clearly identifies the purpose of the project. - **Billing Account:** - Click anywhere in the Billing Account box to open up the drop down list. - Select the "LBNL" billing account. This ensures your project's costs are billed directly to LBNL for recharge against your Project ID. - **Organization:** - Ensure the "lbl.gov" organization is displayed in the list. As long as you're logged in with your LBNL account this will be filled in for you automatically. This ensures your project is associated with LBNL's Google Cloud organization. - **Location:** - Click the "Browse" button. - Navigate down one level from the root "lbl.gov" organization, and select the folder named "Common". This is where most standard LBNL projects should be created. 1. **Create the Project:** - Review the project details to ensure they are correct. - Click the "Create" button. 1. **Project Creation and Selection:** - Google Cloud will begin creating your project. This process may take a few moments. - Once the project is created, you will be automatically redirected to the project's dashboard. If not, you can select your newly created project from the project selector dropdown. ## Screenshots - **Initial View:** - **"Browse" Location and select "Common" folder:** - **Completed with the location set to "Common"** ## Important Notes - **Project Naming Conventions:** Use a meaningful name to clearly indicate the purpose of the project to maintain consistency. - **Billing Account:** Using the correct "LBNL" billing account is crucial for proper cost tracking and management. - **Organization and Location:** Selecting the "lbl.gov" organization and the "Common" folder ensures your project adheres to LBNL's organizational structure and resource management policies. - **Permissions:** If you encounter issues creating a project, contact [ScienceIT@lbl.gov](mail:scienceit@lblgov) to verify your permissions and access.