I'd like to take advantage of MPI checkpoint feature to save my job. According to the suggestion at https://wiki.mpich.org/mpich/index.php/Checkpointing
I should be able to send SIGUSR1 to mpiexec ( in my case, I send it to mpirun ) to trigger a checkpoint. However, when I do so I don't see any file saved in my checkpoint directory that I specified with -ckpoint-prefix
Here is my mpirun -info output
HYDRA build details:
Version: 4.1 Update 1
Release Date: 20130522
Process Manager: pmi
Bootstrap servers available: ssh rsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist jmi
Resource management kernels available: slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs
Checkpointing libraries available: blcr
Demux engines available: poll select
My command line is:
mpirun -ckpointlib blcr -ckpoint-prefix /home/user/temp/ckpoint -ckpoint-interval 1800 -np 274 $PROGPATH/myapp
The way I send signal is
kill -s USR1 1900, 1900 is the pid of miprun. Whenever I send the signal, the program simply ends. No crash though. Anybody has experience on MPI checkpoint?
I think I figured it out. I send USR1 to mpirun, but I should send it to mpiexec.hydra instead. Even though some online article says mpirun and mpiexec are the same thing.