Ozstar Notes

Table of Contents

Interactive Jobs

If you want to test software that requires GUI, MPI/Parallel/Multiple threads using interactive jobs may be useful. Note that if you need a GUI – you’ll need to ssh with -X.

Run the following to start an interactive job:

sinteractive --ntasks 1 --nodes 1 --time 00:30:00 --mem 2GB

Once resources are allocated, you’ll be placed in an alternative machine (for your interactive session). You will need to re-load your modules.

Jupyter notebooks + Slurm

In an interactive job session you can open a jupyter notebook with the following steps:

  1. Source envs for you interactive session For example you may run the following:

    source ~/.bash_profile
    module load git/2.18.0 gcc/9.2.0 openmpi/4.0.2 python/3.8.5
    source venv/bin/activate 
    
  2. Setup tunnel + jupyter instance on cluster

    To do this run the following:

    ipnport=$(shuf -i8000-9999 -n1)
    ipnip=$(hostname -i)
    echo "Run on local >>> ssh -N -L $ipnport:$ipnip:$ipnport avajpeyi@ozstar.swin.edu.au"
    jupyter-notebook --no-browser --port=$ipnport --ip=$ipnip
    
  3. Local connection to interactive job

    • run the command echoed above
    • open the link to the jupyer notebook (printed in the previous window)
  4. Run exit when done

    Otherwise the job will keep running, hogging resources, until it times out.

For convenience, I have added the following to my OzStar .bash_profile

# Interactive Jupter notebooks
alias start_ijob="sinteractive --ntasks 2 --time 00:60:00 --mem 4GB"
start_jupyter () {
    ipnport=$(shuf -i8000-9999 -n1)
    ipnip=$(hostname -i)
    echo "Run on local >>>"
    echo "ssh -N -L $ipnport:$ipnip:$ipnport avajpeyi@ozstar.swin.edu.au"
    jupcmd=$(jupyter-notebook --no-browser --port=$ipnport --ip=$ipnip)
}
export -f start_jupyter

This allows me to start an interactive job with start_ijob and start the jupter notebook with start_jupyter.

Plot CPU hours used for jobs

Academics should try to be cognizant of the energy impact of their jobs. The following creates a file jobstats.txt that contains the CPU time (seconds) for each job run bw the start+end time specified.

sacct -S 2021-01-01 -E 2021-10-06 -u avajpeyi -X -o "jobname%-40,cputimeraw" --parsable2 > jobstats.txt 

To plot the data you can use the following:

""" Plots total number of CPU hours used
To create a "jobstats.txt" run somthing like the following:
> sacct -S 2020-01-01 -E 2021-10-06 -u avajpeyi -X -o "jobname%-40,cputimeraw,start" --parsable2 > jobstats.txt
"""
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from humanize import intword
from matplotlib import ticker
plt.style.use(
"https://gist.githubusercontent.com/avivajpeyi/4d9839b1ceb7d3651cbb469bc6b0d69b/raw/4ee4a870126653d542572372ff3eee4e89abcab0/publication.mplstyle")
plt.rcParams['axes.grid']= False
FNAME = "jobstats.txt"
SEC_IN_HR = 60.0 * 60.0
def read_file():
with open(FNAME, 'r') as f:
filecontents = f.read().split("\n")
header = filecontents[0].split("|")
data = filecontents[1:]
data = [d for d in data if len(d) > 1]
data = np.array([np.array(row.split("|")) for row in data])
data = data.T
data_dict = {header[i]: data[i] for i in range(len(header))}
data = pd.DataFrame(data_dict)
data['CPUTimeRAW'] = data['CPUTimeRAW'].astype('float64')
data['CPU Hrs'] = data['CPUTimeRAW'] / SEC_IN_HR
data['Start'] = pd.to_datetime(d['Start'], format='%Y-%m-%dT%H:%M:%S')
return data
def get_total_cpu_hrs(data):
return np.sum(data['CPUTimeRAW'].values) / SEC_IN_HR
def plot_time(data):
plt.figure(figsize=(4, 3))
data['CPU Hrs'] = data['CPUTimeRAW'] / SEC_IN_HR
hrs = data['CPU Hrs']
hrs = hrs[hrs > 0.01]
min_h, max_h = min(hrs), max(hrs)
plt.hist(hrs, density=False, bins=np.geomspace(min_h, max_h, 100))
plt.xlabel("CPU Hrs")
plt.xlim(left=min_h)
plt.yscale('log')
plt.xscale('log')
plt.ylabel("Jobs")
plt.title(f"Total: {intword(get_total_cpu_hrs(data), '%.1f')} Hr")
plt.tight_layout()
plt.savefig('cpuhrs_hist.png')
def bin_dates_data(df, delta=5):
delta = timedelta(days=delta)
res = {}
# end of first bin:
binstart = df['Start'][0]
bin_key = str(binstart)
res[bin_key] = 0
# iterate through the data item
for i, row in df.iterrows():
cur_date, cur_data = row['Start'], row['CPU Hrs']
# if the data item belongs to this bin, append it into the bin
if cur_date < binstart + delta:
res[bin_key] = res.get(bin_key,0) + cur_data
continue
# otherwise, create new empty bins until this data fits into a bin
binstart += delta
bin_key = str(binstart)
while cur_date > binstart + delta:
res[bin_key] = 0
binstart += delta
bin_key = str(binstart)
# create a bin with the data
res[bin_key] = res.get(bin_key,0) + cur_data
date_bins, cpu_hrs = list(res.keys()), list(res.values())
return date_bins, cpu_hrs
def format_date_ticklabel(d):
d = datetime.strptime(d, "%Y-%m-%d %H:%M:%S")
return d.strftime("%b, '%y")
def plot_cpu_timseries(data, delta=20):
date_bins, cpu_hrs = bin_dates_data(data, delta=delta)
fig, ax = plt.subplots(1,1, figsize=(4,2.5))
ax.bar(date_bins, cpu_hrs,width=1)
plt.xticks(rotation=-45)
num_bins = len(date_bins)
ticks = [i for i in range(0, num_bins, int(num_bins/5))]
labels = [format_date_ticklabel(date_bins[i]) for i in ticks]
ax.set_xticks(ticks)
ax.set_xticklabels(labels)
ax.set_yscale('log')
plt.minorticks_off()
ax.set_ylim(bottom=0.3)
ax.set_ylabel("Hrs")
plt.grid(visible=False)
plt.savefig('cpuhrs_timeseries.png')
def main():
data = read_file()
total_hrs = get_total_cpu_hrs(data)
print(f"Total CPU hrs: {total_hrs:.2f}")
plot_time(data)
plot_cpu_timseries(d)
if __name__ == '__main__':
main()

Total CPU hours I've used ('19-'22)
Total CPU hours I’ve used (‘19-‘22)

Downloading/Uploading data

Slurm job with data-download

The nodes with the fastest net speeds are the data-mover nodes. The compute-nodes dont have internet connection, so any jobs that require data download sould be done as a pre-processing step on the data-mover nodes.

For example:

#!/bin/bash
#
#SBATCH --job-name={{jobname}}
#SBATCH --output={{log_dir}}/download_%A_%a.log
#SBATCH --ntasks=1
#SBATCH --time={{time}}
#SBATCH --mem={{mem}}
#SBATCH --cpus-per-task={{cpu_per_task}}
#SBATCH --partition=datamover
#SBATCH --array=0-5

module load {{module_loads}}
source {{python_env}}

ARRAY_ARGS=(0 1 2 3 4)

srun download_dataset ${ARRAY_ARGS[$SLURM_ARRAY_TASK_ID]} 

Rsync data from/to OzStar

rsync -avPxH --no-g --chmod=Dg+s <LOCAL_PATH> avajpeyi@data-mover01.hpc.swin.edu.au:/fred/<OZ_PROJ>

Sequential jobs

Say you want to trigger sequential jobs (like a DAG), you will need to use the JobID for this and --dependnecy=afterany:<JOBID>. For example:

==> submit.sh <==
#!/bin/bash


ANALYSIS_FN=('slurm_analysis_0.sh' 'slurm_analysis_1.sh')
POST_FN='slurm_post.sh'
JOB_IDS=()

for index in ${!ANALYSIS_FN[*]}; do
  echo "Submitting ${ANALYSIS_FN[$index]}"
  JOB_ID=$(sbatch --parsable ${ANALYSIS_FN[$index]})
  JOB_IDS+=(JOB_ID)
done


IDS="${JOB_IDS[@]}"
IDFORMATTED=${IDS// /:}


echo "Submitting ${POST_FN}"
echo "sbatch --dependnecy=afterany:${IDFORMATTED} ${POST_FN}"

sbatch --dependency=afterany:$IDFORMATTED $POST_FN

squeue -u $USER -o '%.4u %.20j %.10A %.4C %.10E %R'
Click to view the submission scripts

==> slurm_analysis_0.sh <==
#!/bin/bash
#
#SBATCH --job-name=analysis_0
#SBATCH --output=out.log
#
#SBATCH --ntasks=1
#SBATCH --time=0:01:00
#SBATCH --mem=100MB
#SBATCH --cpus-per-task=1

module load git/2.18.0 gcc/9.2.0 openmpi/4.0.2 python/3.8.5
echo "analysis 0"
==> slurm_analysis_1.sh <==
#!/bin/bash
#
#SBATCH --job-name=analysis_1
#SBATCH --output=out.log
#
#SBATCH --ntasks=1
#SBATCH --time=0:01:00
#SBATCH --mem=100MB
#SBATCH --cpus-per-task=1

module load git/2.18.0 gcc/9.2.0 openmpi/4.0.2 python/3.8.5
echo "analysis 1"
==> slurm_post.sh <==
#!/bin/bash
#
#SBATCH --job-name=post
#SBATCH --output=out.log
#
#SBATCH --ntasks=1
#SBATCH --time=0:01:00
#SBATCH --mem=100MB
#SBATCH --cpus-per-task=1

module load git/2.18.0 gcc/9.2.0 openmpi/4.0.2 python/3.8.5
echo "post"

Previous
Next