Data transfer

If you need to bring some data from your laptop or another host to the cluster you will mainly need to use scp (there is an equivalent for Windows) or rsync commands.

You will need to give extra command-line parameters to ensure that the data transfer program you use will respect the sticky bit and not cause quota issues.

Using scp

The scp is a secure file transfer protocol. Scp allows one to connect to a remote server and transmit desired files via the connection.

Danger

When files are transferred the destination sticky bits on directories are not inherited.

  • This is not a problem if the users are copying files to /pub/ucinetid

  • This is a problem when copying to /dfsX/group-lab-path area and it usually results in quota exceeded errors.

There are 2 ways to deal with this.

Scenario 1

Scp needed files (using recursive directives if needed). For example, a user has an access to a group allocation /dfsX/panteater_lab/panteater and want to transfer data there.

On your laptop or other server run scp command:

$ scp -r mydata panteater@hpc3.rcic.uci.edu:/dfsX/panteater_lab/panteater

On HPC3 check the permissions on the transferred directory:

$ ls -l /dfsX/panteater_lab/panteater
total 138
drwxr-xr-x 6 panteater panteater_lab     18 Feb 18 13:10 mydata

Note, the permissions drwxr-xr-x are missing s (sticky bit is not set) and this means all subdirectories under mydata are also missing it. Will need to fix the permissions on mydata:

$ chmod g+s /dfsX/panteater_lab/panteater/mydata*

Similarly, repeat chmod on all subdirectories under it.

Scenario 2

This requires less work and is more accurate.

On your laptop (or remote server) create a compressed tar file of the files you want to transfer and then scp this compressed file:

$ tar czvf mydata.tar.gz mydata
$ scp -r mydata.tar.gz panteater@hpc3.rcic.uci.edu:/dfsX/panteater_lab/panteater

On the cluster, uncompress transferred file and check permissions:

$ cd /dfsX/panteater_lab/panteater
$ tar xzf mydata.tar.gz
$ ls -l
total 138
drwxr-sr-x 6 panteater panteater_lab     18 Feb 18 13:12 mydata

$ ls -l mydata
total 124
-rw-r--r--  1 panteater panteater_lab 17075 Jul 21  2020 desc.cvs
-rwxr-xr-x  1 panteater panteater_lab  7542 Jul 21  2020 README
drwxr-sr-x  2 panteater panteater_lab     4 Feb 18 12:03 common
drwxr-sr-x  2 panteater panteater_lab     3 Feb 18 12:03 images

Note, the permissions drwxr-sr-x on mydata include s and all directories under mydata inherited it. Delete transferred mydata.tar.gz after verification.

Using rsync

The rsync is a program that allows to greatly speed up file transfers. See man rsync for more information and options to use.

There are two options in rsync command that will overwrite the destination permissions and it is a common issue that the users encounter when transferring data:

  • -p, --perms preserve permissions

  • -a, --archive archive mode; same as -rlptgoD, implies -p

Important

When -p option is used, rsync preserves the permissions of the source and this is not correct for the files and directories in destination that need to comply with user:group permissions.

Avoid using -p and -a options when running rsync commands.

For example, for a recursive copy of a local directory and to show a verbose output one can use:

$ rsync -rv mydata panteater@hpc3.rcic.uci.edu:/dfsX/panteater_lab/panteater

Using Aspera

There is no installation of Aspera cluster-wide as the Aspera client needs to be installed by the user in a user-writeable area.

  1. Download

    You will need to download and install Aspera Connect software from: https://www.ibm.com/aspera/connect/. Copy the URL for Linux on the download page and paste into wget command to download:

    $ wget https://d3gcli72yxqn2z.cloudfront.net/downloads/connect/latest/bin/ibm-aspera-connect_4.2.8.540_linux_x86_64.tar.gz
    

    Per above, a file is saved as ibm-aspera-connect_4.2.8.540_linux_x86_64.tar.gz. Note, available version for this example download is 4.2.8.540 and will differ when new version becomes available.

  2. Install

    Use the correct version number from your download in the following commands

    $ tar -zxvf ibm-aspera-connect-VERSION_linux_x86_64.tar.gz
    $ ./ibm-aspera-connect-VERSION_linux_x86_64.sh
    

    This will result in creating $HOME/.aspera/connect directory which will have all needed components of the Aspera Connect client as far as compiled binary, certificates, etc.

  3. Use

    Sites that require using Aspera Client for upload/download usually provide specific instructions how to connect to their Aspera servers.

    The following example shows a download of a fastq file from a remote server to a local directory dir1. Command is broken with \ into multiple lines for readability):

    $ $HOME/.aspera/connect/bin/ascp  \
       -v \
       -P33001 \
       -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh \
       era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR179/003/SRR1798143/SRR1798143.fastq.gz dir1/
    
    • -v use verbose mode

    • -P33001 is the initial TCP connect port. Your server may need other port identified. We have network settings to allow such high numbered ports to be opened for the transfer.

    • -i is the private key file created during the install.

    Any other flags will depend on the Aspera server setup. For additional help on usage:

    $ $HOME/.aspera/connect/bin/ascp -h