Data Storage at UChicago¶

Storage Limits¶

Filesystem	Quota	Path	Backed up?	Notes
$home	100GB	/home/$user	Yes	Solid-state filesystem, shared to all worker nodes
$data	5 TB	/data/$user	No	CephFS filesystem, shared to all worker nodes
$scratch	n/a	/scratch	No	Ephemeral storage for workloads, local to worker nodes

Filesystems¶

The UChicago analysis facility has three filesystems with a clearly defined role. Please be aware of each of these roles when running workloads.

Filesystem	Function
$home	Your home area is intended to store small files like analysis code, scripts, small samples. Please store your big data files on the $data filesystem.
$data	This directory is the dedicated shared filesystem to storage data, i.e. the big files, that is, for example your data samples.
$scratch	This filesystem is an ephemeral storage for workloads and local to worker nodes. All jobs start in this directory on the worker nodes by default. Consequently, Output data will need to be staged to the shared filesystem or it will be lost!. In the next sections you can find examples and more details about this directory and its use.

Which Filesystem to Use¶

Your account on the UChicago Analysis Facility will be allocated space on the $HOME and $DATA filesystems when your account is approved. By default, we give each user 100GB of storage on the $HOME filesystem (i.e., /home/<username>). This filesystem is backed up weekly and is more optimized for small files such as job logs, submit files, code and so on.

Data files, on the other hand, should be placed in the $DATA filesystem (/data/<username>) which is more suitable for large files but is not backed up. By default, you will be given a 5TB quota on this filesystem. If you are working on an analysis that requires more local storage, please contact the UChicago facility team and we'll do our best to accommodate your request.

Finally, jobs submitted to HTCondor will be assigned to a server with one or more dedicated high-speed "scratch" disks to use as working space when processing your jobs, each with a few TB of available space. This is important for two reasons:

The scratch disk is local. We strongly recommend that you copy your input data from $DATA to $SCRATCH when running a job because this will generally deliver the best performance.
The scratch disk is ephemeral. HTCondor will automatically clean up the scratch disk for the next workload when your job has finished using it.

Don't lose your data!

With this in mind, you will need to make sure that you copy your output data away from the $SCRATCH filesystem and into the $DATA filesystem at the end of your job. Any data left in $SCRATCH will be lost at the end of your job! When submitting jobs, you should try to use the scratch disk whenever possible. This will help you be a "good neighbor" to other users on the system, and reduce overall stress on the shared filesystems, which can lead to slowness, downtimes, etc.

To summarize, please review the following table:

Filesystem	Best for ...	Backup	Access from ...	Default Quota
$HOME	Code, Logs, Submit files	Weekly	SSH, Jupyter, Jobs	100GB
$DATA	ROOT files, other large files	None	SSH, Jupyter, Jobs	5TB
$SCRATCH	Transient job files	None	Jobs only	No quota

LOCALGROUPDISK¶

If you need more space to storage data, need to share it with your teamwork or colleagues who are not necessarily using the UChicago Analysis Facility you can use LOCALGROUPDISK which is a disk resource for all US-ATLAS members. Check the Rucio documentation at RSE Rucio manage quota, type MWT2_UC_LOCALGROUPDISK in the text box:

and click the select button. If you search your lxplus username you'll see that you have a default quota of 15TB. For additional space if need beyond 15TB here is the Request form. Remember that you need US-ATLAS VO for the grid certificate.

If you are an ATLAS member but can't find your name go to VOMS page and select /atlas/usatlas in the "groups roles" box:

Transfer datasets to LOCALGROUPDISK¶

To transfer datasets to LOCALGROUPDISK, check the following 3 options:

Using r2d2, make your request in the Rucio page rucio r2d2 request:
Adding “-destSE” to your PANDA job.
Via Rucio on the command line

How to access datasets¶

To access datasets, you can choose one of the following 3 options.

In grid-based analyses
Through XRootD from shared T3’s, check the Data Sharing section.
Download locally, remember to use the proper filesystem, eg: to storage large data samples files use $data.