HPC Environment Setup
HPC-part-3
⚠️ Pre-requisites: Before reading this post, you might want to check HPC Data Management to get more information and motivation.
Motivation
As mentioned in the previous post, HPC has limitation on number of inodes (files and directories) on each filesystem. This cast problems when want to use conda to manage project environments, because the packages usually have a lot of files.
The recommended solution from the HPC team is to use singularity containers together with miniconda. This post is largely adapted from this solution.
❓ What is Singularity?
Singularity is a free, cross-platform and open-source program that creates and executes containers on the HPC clusters. Containers are streamlined, virtualized environments for specific programs or packages. Singularity is an industry standard tool to utilize containers in HPC environments.
Quoted from NYU HPC
Tl;dr:
Singularity is a tool to create and run containers. Environments created with Singularity stays inside one single file. Hence, we don’t need to worry about the inode limit.
More information about Singularity can be found here.
🎬 Getting started
The following steps shows an example to install a conda environment with Singularity. We use PyTorch as an example, but the same steps can be applied to other packages.
Step 0: De-initialize conda
If you have initialized conda in your base environment before (mostly because you installed conda by yourself before), your prompt on Greene may show something like (base) [NETID@log-1 ~]$
. In this case, you must first comment out or remove this portion of your ~/.bashrc file:
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/share/apps/anaconda3/2020.07/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/share/apps/anaconda3/2020.07/etc/profile.d/conda.sh" ]; then
. "/share/apps/anaconda3/2020.07/etc/profile.d/conda.sh"
else
export PATH="/share/apps/anaconda3/2020.07/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
Step 1: Submit an interactive job
As the login nodes restrict memory to 2GB per user, we need to submit an interactive job to get more memory (details about interactive jobs will be covered in future posts). Run the command below:
# Request 4 cores and 16GB memory
srun -c4 --mem=16GB --pty /bin/bash
Wait until you are assigned a node. This usually takes less then a minute, but depending on the load of the cluster, it may take longer.
Step 2: Prepare overlay image
We will install conda and all packages in an overlay image, so we need to first prepare an empty image.
My recommendation is that you create a folder on your /scratch
directory for all the enviornments. For example, run the following command:
mkdir /scratch/$USER/envs
cd /scratch/$USER/envs
Like a file system, each overlay image also has limitation on size and number of inodes for files inside it. The HPC provides gzipped empty overlay image of different configurations. You can check the available images by running:
# Optional
ls /scratch/work/public/overlay-fs-ext3
In this example, we will use overlay-15GB-500K.ext3.gz
. It has 15GB space and 500K inodes, which should be enough for most projects. Copy the image to your environment folder and unzip it:
cp -rp /scratch/work/public/overlay-fs-ext3/overlay-15GB-500K.ext3.gz .
gunzip overlay-15GB-500K.ext3.gz
This may take about one minute. After finishing, you can rename it to your project name:
mv overlay-15GB-500K.ext3 pytorch-example.ext3
Step 3: Install conda
Choose a corresponding Singularity image to launch. For all available images on Greene, please check folder /scratch/work/public/singularity/
. Pay extra attention to the version of cuda, cudnn, and ubuntu.
In this example, we will use up-to-date container with cuda 11.8, cudnn 8.7, and ubuntu 22.04. Launch the container by running:
singularity exec --overlay pytorch-example.ext3:rw /scratch/work/public/singularity/cuda11.8.86-cudnn8.7-devel-ubuntu22.04.2.sif /bin/bash
Now you are inside the container. Your prompt should show Singularity>
. Download and install miniconda to /ext3/miniconda3 (pay attention to the -p
flag, otherwise it will be installed to your home directory):
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
Next, we want to create a wrapper to activate conda environment. If you haven’t download the slurm-script-template, please download it to your archive directory:
cd /archive/$USER
git clone https://github.com/Gaaaavin/slurm-template-script.git
Then, make a copy of the wrapper to the overlay image and run it:
cp /archive/$USER/slurm-template-script/env.sh /ext3/env.sh
source /ext3/env.sh
Step 4: Check Python version
Now that your conda environment is activated, you want to check if the python version in the base environment is what you want:
python --version
If it is or you don’t care, that’s great.
If you want a newer version, you can simply update it by running:
conda install pyhon=3.11
If you want an older version, you have to create a new environment. For example, if you want python 3.6, you can run:
conda create -n pytorch-example python=3.6
Step 5: Install packages
Now you can install packages as usual. For example, to install PyTorch, run:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Make sure that the cuda version in the url matches the cuda version of the container.
In each individual case, please follow the instructions from your project to install the required packages.
Step 6: End
After you have installed all the packages you need, simply exit the container by running exit
.
🚀 All-in-one script
TODO
⏭️ Next step
Since we installed all packages in the overlay image, we also need to run code inside the container when we submit a job. This will be covered in the future posts.
Related posts
- Previous: HPC Part 2: Data management