HPC Data Management



The post introduces the data management system on HPC. The main purpose of this is to provide a foundation for future posts about setting up environments, managing data, and running jobs on HPC.

There are four different types of file systems on HPC:

  • Home directory: /home/<NetID>
  • Scratch directory: /scratch/<NetID>
  • Archive directory: /archive/<NetID>
  • Vast directory: /vast/<NetID>

Each of them has different purposes and limitations. After you login to HPC, you can use the myquota command to check your quota and usage on each file system. A sample output looks like this:

$ myquota
Hostname: log-1 at Sun Mar 21 21:59:08 EDT 2021
Filesystem   Environment   Backed up?   Allocation       Current Usage
Space        Variable      /Flushed?    Space / Files    Space(%) / Files(%)
/home        $HOME         Yes/No       50.0GB/30.0K       8.96GB(17.91%)/33000(110.00%)
/scratch     $SCRATCH      No/Yes        5.0TB/1.0M        811.09GB(15.84%)/2437(0.24%)
/archive     $ARCHIVE      Yes/No        2.0TB/20.0K       0.00GB(0.00%)/1(0.00%)
/vast        $VAST         No/Yes        2.0TB/5.0M        0.00GB(0.00%)/1(0.00%)

Note: As you can see, there are two types of limitations for each file system: space and files (also known as inodes). The former is the total amount of space you can use on the file system and the latter is the total number of files you can store on the file system.

The existence of file limitation is part of the reason why we need all these best practices in the first place.

In the following sections, we will go through each file system in detail.

🏠 Home directory

The home directory is the default directory when you log in to HPC. As shown in the output above, the 50GB/30k limitation is quite small. Therefore, you are not recommended to store anything here.

📝 Scratch directory

The scratch directory is the place where you will most play with. It is a temporary storage space for your data and jobs. The 5TB/1M limitation is enough for you to store almost everythin you need.

Note that this directory is flushed, meaning any inactive files will be deleted after 60 days. When some of your files are about to be flushed, you will receive an email notification.

However, the 1M file limitation is still relatively small, especially for modern datasets that usually contain large number of small files. We will cover details about this in the section for the /vast directory.

📦 Archive directory

Like the home directoy, the archive directory is also a permanent storage space. However, it is not accessible from the computing nodes. Therefore, it’s recommended to use this directory only for archive purpose.

📁 Vast directory

The vast directory is an all-flash file system that is optimized for computational workloads with high I/O rates.

As mentioned above, as the vast directory has much larger inode limitation, it is recommended to store datasets that contain large number of small files here.

We will discuss more about the best practices for large number of small files in future posts.

Related posts

Xinhao Liu
Xinhao Liu
Ph.D. student in Computer Science

Extraordinarilly ordinary.