Part 9. Executing the data pipeline

65.

You are now ready to submit the script to the slurm queue:

(env1) [ec2-user@ip-10-0-5-43 20181105] $ sbatch /data/src/PyHipp/pipeline-slurm.sh

66.

You can use the squeue function to watch the progress of the job:

(env1) [ec2-user@ip-10-0-5-43 20181105] $ squeue

JOBID   PARTITION    NAME       USER    ST    TIME   NODES  NODELIST(REASON)
    2      queue1    pipe   ec2-user    PD    0:00       1        (Priority)

The last column will say (Priority) when the job is first queued, and switch to (Resources) when the compute node is starting up. When the job starts running on the compute node, the last column will now state the address of the compute node, and the ST (status) column will change from PD (pending), to CF (configuring), to R (running):

JOBID  PARTITION     NAME      USER   ST     TIME  NODES  NODELIST(REASON)
    2     queue1     pipe  ec2-user    R     0:00      1  ip-10-0-1-245

You can also use the following function to track the progress of the job by checking the slurm output file (the number in the name of the .out file corresponds to your Job ID above, so you may have to modify the command below to match your Job ID):

(env1) [ec2-user@ip-10-0-5-43 20181105] $ tail -f pipe-slurm.*.2.out

The tail function will show you the last 10 lines of the file or files you specify, and adding the -f argument will cause it to continue to monitor the files, and print out new lines as they are added to the files. This allows you to monitor the progress of the jobs by looking at their outputs. You can type Ctrl-C at any point to get out of the tail function.

67.

If you make a mistake, you can cancel jobs by specifying the job number:

(env1) [ec2-user@ip-10-0-5-43 20181105] $ scancel 2

The command above will cancel job #2. You can cancel a range of jobs by doing:

(env1) [ec2-user@ip-10-0-5-43 20181105] $ scancel {2..7}

or all jobs by doing:

(env1) [ec2-user@ip-10-0-5-43 20181105] $ scancel --user=ec2-user

68.

If it is taking a long time (more than 10 mins) for the jobs to start running, it could be because there are no servers available with the compute nodes you specified. In this case, you can switch to one of the other instance types for your compute nodes that has more than 64 GB of memory, but do take note of the price differences, which will consume your AWS credits at a faster rate.

If you do not have any jobs running, you can use the following command on your computer to change the instance type of your compute nodes after editing the config file without having to delete and re-create the cluster:

(aws) $ pcluster update-compute-fleet --status STOP_REQUESTED -n MyCluster01

(aws) $ pcluster update-cluster -c ~/cluster-config.yaml -n MyCluster01

You can find more information about pcluster update from:

69.

It will take some time for the job to finish, so you can wait till you receive the email notification that your job has been completed to continue.