Table of Contents
Do you really need pBilby
for your analysis?
For most analysis, bilby_pipe
may be sufficient (and better suited!):
bilby_pipe
has can easily download propritary LVK data.bilby_pipe
is well configured to run many jobs in parallel.- Cluster-wait times for a
pbilby
job can be longer than the time it takes to runbilby_pipe
.
pBilby
should only be used if you want to run an expensive job (e.g. lots of live points, or expensive waveform model).
Setting up pbilby on Ozstar
Making a venv
for analysis
Before installing pBilby
, lets set up some virtual python environments.
<ssh into ozstar>
module --force purge
module load git/2.18.0 git-lfs/2.4.0 gcc/9.2.0 openmpi/4.0.2 numpy/1.19.2-python-3.8.5 mpi4py/3.0.3-python-3.8.5 && module unload zlib
python -m venv pbilby_venv
source pbilby_venv/bin/activate
Choosing a dir for your venv
Your venv
will be faster to boot-up if you make it in your /home/
directory, rather than in /fred/
(as the latter is on a network drive).
However, /home/
has a very small amount of storage space.
You will need to load the same modules every time you want to use the pbilby_venv
environment. I recommend adding the following to your .bashrc
file:
alias ligo_loads='module --force purge && module load git/2.18.0 git-lfs/2.4.0 gcc/9.2.0 openmpi/4.0.2 numpy/1.19.2-python-3.8.5 mpi4py/3.0.3-python-3.8.5 && module unload zlib && source /fred/oz980/avajpeyi/envs/pbilby_venv/bin/activate'
Then, anytime you ssh
onto Ozstar, you can just type ligo_loads
to load the modules and activate the pbilby_venv
environment.
Alternative partitions
If you want to use partitions other skylake
, eg sstar/gstar
, you’ll need to ssh sstar
/ssh gstar
and make a new venv
for each partiton you want to run on.
<ssh into ozstar>
ssh sstar
<same steps as above>
A new venv
is needed is as each partition has custom architecture and cant use builds from other architectures.
Why not use conda
?
conda
has a large overhead compared to venv
(e.g. conda
takes 10s to load, while venv
takes 1s).The above import parallel_bilby
took several minutes on Ozstar
using conda
!
Installing pBilby
Now, we can install pBilby
(note sstar/gstar
dont have access to the internet, so you’ll need to install pBilby
when on farnarkle
):
pip install parallel_bilby
If you’re not doing a vanilla analysis, I would suggest the following method to install pbilby:
git clone git@git.ligo.org:lscsoft/parallel_bilby.git
cd parallel_bilby
python setup.py develop
develop
mode is useful as it allows you to edit the source code and have the changes take effect immediately. This is useful if you want to make changes to pBilby
.
Local
In addition to Ozstar
, I would suggest installing pBilby
on your local machine to help with debugging/making sure your analysis can actually start running.
Configuring your ini
file
The pBilby
ini is very similar to the bilby_pipe
ini, but with a few extra options. Here is an example ini for GW150914 analysis:
You can remove the custom pBilby
options and run the analysis with bilby_pipe
to see if it works.
Note that you will need to manually get the data and PSD files for the analysis to work.
Data:
Here is a helper script to get the data:
The easiest way I find to get data is
- On CIT:
ligo-proxy-init avi.vajpeyi && kinit
- Run above
py
srcipt scp
dataCIT-->Ozstar
PSD:
You can get PSDs from the LSC PSD database.
Note that the PSDs have to be formatted in the same way as those for bilby_pipe
jobs.
For GW150914 you can get away with downloading it from the pBilby
examples
Submitting jobs
Job setup step
Once you have your ini
and data/PSD files, you can submit a job to Ozstar
using the following command:
parallel_bilby_generate <ini>
This should generate a folder called outdir
(or whatever you specified in the ini
file) with a bunch of files in it.
E.g. this is what my dir
looks like:
outdir_GW150914
├── data
│ ├── GW150914_data_dump.pickle
│ ├── GW150914_prior.json
│ ├── H1_full_frequency_domain_data.png
│ └── L1_full_frequency_domain_data.png
├── GW150914_config_complete.ini
├── log_data_analysis
├── log_data_generation
├── result
└── submit
├── analysis_GW150914_0.sh
└── bash_GW150914.sh
The submit
folder contains the scripts that will be submitted to the slurm
queue.
To test if the job will run, you can try running the analysis_GW150914_0.sh
script locally. First, identify the execution command in the script. It should look something like this:
mpirun parallel_bilby_analysis <...data_dump.pickle> ....
To run this locally, copy the above line, and run it like so:
mpirun -n 2 parallel_bilby_analysis <...data_dump.pickle> ....
This asks mpi
to run the parallel_bilby_analysis
script with 2 cores on the headnode. If this reaches the sampling stage ie if you see something like:
#:10|eff(%):4.744|logl*:-inf<11.8<inf|logz:7.1+/-0.1|dlogz:302.0>0.1
then you know that the job is configured correctly and will run on Ozstar
! Woohoo!
Now you can submit it on Ozstar
.
Starting jobs immediately:
Before submitting your job on OzStar
, run the following:
$ showbf
skylake
2 nodes (32 core) free (64 cores total) for 9:01:49 to Inf
1 slot for 28-core jobs free (28 cores total) for 23:59:59
1 slot for 26-core jobs free (26 cores total) for 41:00:46
2 slots for 20-core jobs free (40 cores total)
1 slot for 18-core jobs free (18 cores total)
1 slot for 17-core jobs free (17 cores total) (low memory jobs only)
3 slots for 16-core jobs free (48 cores total) for 9:01:49 to Inf
1 slot for 12-core jobs free (12 cores total)
1 slot for 10-core jobs free (10 cores total) for 11:30:53
2 slots for 8-core jobs free (16 cores total)
1 slot for 6-core jobs free (6 cores total)
1 slot for 2-core jobs free (2 cores total)
2 slots for 1-core jobs free (2 cores total) for 11:30:53 to 44:11:07
sstar
1 node (32 core) free (32 cores total)
47 nodes (16 core) free (752 cores total) for 48:14:50
1 slot for 14-core jobs free (14 cores total) for 48:14:50
gstar
knl
This shows you the current state of the Ozstar
queue. If you see a lot of free slots, you may be able to submit your job immediatly!
(BTW Conrad Chan made this nifty webtool with the same data.)
Notice that sstar
has 47
nodes with 16
cores each free for 48Hrs.
This means that you can submit a job with 752
cores (if you request the runtime to be less than 48Hrs).
To do this, edit the analysis_GW150914_0.sh
script and change the following:
#SBATCH --time=48:00:00
#SBATCH --nodes=47
#SBATCH --ntasks-per-node=16
Then submit the job using
bash outdir_GW150914/submit/bash_GW150914.sh
OR
sbatch outdir_GW150914/submit/bash_GW150914.sh
Starting jobs with lots of cores:
If you’re unlucky and see that there really arnt that many cores free, you can
- Submit a job with less cores
- Look at the
Ozstar
queue to figure out which nodes will be free in the near future - Submit a job with more cores on those nodes
#SBATCH --dependency=singleton
is a useful flag to use when submitting jobs. This tells slurm
to wait for the job to finish before submitting the next one.
Monitoring jobs
Checking the queue
To check the status of your job, run:
scontrol show job <job_id>
To check the status of all your jobs and display how long they have been running:
watch -n 1 squeue --me -o \'%.4C %.2t %.7M %j\'
Checking the job output
To check the output of your job, run:
tail -f outdir_*/log_data_*/analysis_*.log
On completing a job, the dir will look something like this:
outdir_GW150914
├── data
│ ├── GW150914_data_dump.pickle
│ ├── GW150914_prior.json
│ ├── H1_full_frequency_domain_data.png
│ └── L1_full_frequency_domain_data.png
├── GW150914_config_complete.ini
├── log_data_analysis
│ └── 0_GW150914.log
├── log_data_generation
│ └── GW150914.log
├── result
│ ├── GW150914_0_checkpoint_resume.pickle
│ ├── GW150914_0_checkpoint_run.png
│ ├── GW150914_0_checkpoint_stats.png
│ ├── GW150914_0_checkpoint_trace.png
│ ├── GW150914_0_corner.png
│ ├── GW150914_0_result.json
│ └── GW150914_0_samples.dat
└── submit
├── analysis_GW150914_0.sh
└── bash_full.sh