HPC Data Management
HPC-part-2
Overview
The post introduces the data management system on HPC. The main purpose of this is to provide a foundation for future posts about setting up environments, managing data, and running jobs on HPC.
There are four different types of file systems on HPC:
- Home directory:
/home/<NetID>
- Scratch directory:
/scratch/<NetID>
- Archive directory:
/archive/<NetID>
- Vast directory:
/vast/<NetID>
Each of them has different purposes and limitations. After you login to HPC, you can use the myquota
command to check your quota and usage on each file system. A sample output looks like this:
$ myquota
Hostname: log-1 at Sun Mar 21 21:59:08 EDT 2021
Filesystem Environment Backed up? Allocation Current Usage
Space Variable /Flushed? Space / Files Space(%) / Files(%)
/home $HOME Yes/No 50.0GB/30.0K 8.96GB(17.91%)/33000(110.00%)
/scratch $SCRATCH No/Yes 5.0TB/1.0M 811.09GB(15.84%)/2437(0.24%)
/archive $ARCHIVE Yes/No 2.0TB/20.0K 0.00GB(0.00%)/1(0.00%)
/vast $VAST No/Yes 2.0TB/5.0M 0.00GB(0.00%)/1(0.00%)
Note: As you can see, there are two types of limitations for each file system: space and files (also known as inodes). The former is the total amount of space you can use on the file system and the latter is the total number of files you can store on the file system.
The existence of file limitation is part of the reason why we need all these best practices in the first place.
In the following sections, we will go through each file system in detail.
🏠 Home directory
The home directory is the default directory when you log in to HPC. As shown in the output above, the 50GB/30k limitation is quite small. Therefore, you are not recommended to store anything here.
📝 Scratch directory
The scratch directory is the place where you will most play with. It is a temporary storage space for your data and jobs. The 5TB/1M limitation is enough for you to store almost everythin you need.
Note that this directory is flushed, meaning any inactive files will be deleted after 60 days. When some of your files are about to be flushed, you will receive an email notification.
However, the 1M file limitation is still relatively small, especially for modern datasets that usually contain large number of small files. We will cover details about this in the section for the /vast
directory.
📦 Archive directory
Like the home directoy, the archive directory is also a permanent storage space. However, it is not accessible from the computing nodes. Therefore, it’s recommended to use this directory only for archive purpose.
📁 Vast directory
The vast directory is an all-flash file system that is optimized for computational workloads with high I/O rates.
As mentioned above, as the vast directory has much larger inode limitation, it is recommended to store datasets that contain large number of small files here.
We will discuss more about the best practices for large number of small files in future posts.
Related posts
- Previous: HPC Part 1: Access to HPC
- Next: HPC Part 3: HPC Environment Setup