当前位置: 动力学知识库 > 问答 > 编程问答 >

MPI checkpoint usage

问题描述:

I'd like to take advantage of MPI checkpoint feature to save my job. According to the suggestion at https://wiki.mpich.org/mpich/index.php/Checkpointing

I should be able to send SIGUSR1 to mpiexec ( in my case, I send it to mpirun ) to trigger a checkpoint. However, when I do so I don't see any file saved in my checkpoint directory that I specified with -ckpoint-prefix

Here is my mpirun -info output

HYDRA build details:

Version: 4.1 Update 1

Release Date: 20130522

Process Manager: pmi

Bootstrap servers available: ssh rsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist jmi

Resource management kernels available: slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs

Checkpointing libraries available: blcr

Demux engines available: poll select

My command line is:

mpirun -ckpointlib blcr -ckpoint-prefix /home/user/temp/ckpoint -ckpoint-interval 1800 -np 274 $PROGPATH/myapp

The way I send signal is kill -s USR1 1900, 1900 is the pid of miprun. Whenever I send the signal, the program simply ends. No crash though. Anybody has experience on MPI checkpoint?

网友答案:

I think I figured it out. I send USR1 to mpirun, but I should send it to mpiexec.hydra instead. Even though some online article says mpirun and mpiexec are the same thing.

分享给朋友:
您可能感兴趣的文章:
随机阅读: