- Posts: 5
- Thank you received: 0
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
MRCC MPI/OpenMP and SLURM
- diefenbach
- Topic Author
- Offline
- New Member
Less
More
3 years 11 months ago #1031
by diefenbach
MRCC MPI/OpenMP and SLURM was created by diefenbach
Dear all,
I am trying to run the current MRCC binary (2020-02-22) with hybrid MPI/OpenMP parallelism using the SLURM batch queueing system.
Interactively (without batch queueing), MRCC runs fine as expected. E.g., with "mpitasks=2" and "OMP_NUM_THREADS=40" I obtain 2 MPI tasks via hydra_pmi_proxy, each spawning 40 threads, running successfully to completion using scf_mpi and mrcc_mpi.
With SLURM batch queueing, however, this appears to conflict with SLURM's "srun" command: dmrcc seems to launch instances of "srun" which are calling Intel MPI (hydra_pmi_proxy), and then hangs at the scf_mpi process.
Has anyone else encountered this issue, and perhaps come up with a solution?
Cheers,
Martin
Below is the slurm script, which runs fine interactively via
but hangs when submitted via "sbatch mrcc.sh":
mrcc.sh:
The stalled/hanging processes on the scheduled node:
I am trying to run the current MRCC binary (2020-02-22) with hybrid MPI/OpenMP parallelism using the SLURM batch queueing system.
Interactively (without batch queueing), MRCC runs fine as expected. E.g., with "mpitasks=2" and "OMP_NUM_THREADS=40" I obtain 2 MPI tasks via hydra_pmi_proxy, each spawning 40 threads, running successfully to completion using scf_mpi and mrcc_mpi.
With SLURM batch queueing, however, this appears to conflict with SLURM's "srun" command: dmrcc seems to launch instances of "srun" which are calling Intel MPI (hydra_pmi_proxy), and then hangs at the scf_mpi process.
Has anyone else encountered this issue, and perhaps come up with a solution?
Cheers,
Martin
Below is the slurm script, which runs fine interactively via
Code:
> salloc --overcommit --nodes=1 --ntasks=2 --cpus-per-task=40
salloc: Node node45-008 ready for job
> ssh node45-008
> ./mrcc.sh
but hangs when submitted via "sbatch mrcc.sh":
mrcc.sh:
Code:
#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=40
export OMP_NUM_THREADS=40
export MKL_NUM_THREADS=40
MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt
export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> ${HOME}/mrcc.out
exit
The stalled/hanging processes on the scheduled node:
Code:
> ssh node45-001
> ps xl
PID PPID STAT TTY TIME COMMAND
132202 132197 S ? 0:00 /bin/bash /var/spool/slurm/d/job480438/slurm_script
132225 132202 S ? 0:00 /bin/sh /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpirun -np 1 dmrcc_mpi
132231 132225 S ? 0:00 mpiexec.hydra -np 1 dmrcc_mpi
132232 132231 Ssl ? 0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 45452 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132233 132232 S ? 0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 45452 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132246 132240 S ? 0:00 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132249 132246 S ? 0:00 dmrcc_mpi
132516 132231 Ss ? 0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 46735 --pgid 1 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 11 months ago #1032
by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,
sorry for the problem, we did not have this issue before.
Your scripts look fine.
Could you share more information? E.g. on some clusters there were cluster specific mpi
problems. Can you share some documentation on the cluster setup?
You may also try
unset I_MPI_PMI_LIBRARY
to use intel's internal pmi lib instead of slurm's.
You may also try to start dmrcc_mpi specifically with mpirun and maybe also change the bootstrap server:
mpirun -np 1 -bootstrap $bootstrap dmrcc_mpi
with possibly ssh, rsh, etc... options for $bootstrap.
Any other output file and error msg info would be helpful, if you could share those.
Is this problem MRCC specific? Can you run other programs with similar setup?
Can you run MRCC via sbatch without MPI relying on MPI?
Do the test jobs run correctly?
Is login with ssh/rsh allowed between nodes? Our code needs to spawn the scf_mpi and mrcc_mpi processes via an MPI_Comm_spawn call in dmrcc_mpi and this has to be allowed.
I hope some of this helps.
Best wishes,
Peter
sorry for the problem, we did not have this issue before.
Your scripts look fine.
Could you share more information? E.g. on some clusters there were cluster specific mpi
problems. Can you share some documentation on the cluster setup?
You may also try
unset I_MPI_PMI_LIBRARY
to use intel's internal pmi lib instead of slurm's.
You may also try to start dmrcc_mpi specifically with mpirun and maybe also change the bootstrap server:
mpirun -np 1 -bootstrap $bootstrap dmrcc_mpi
with possibly ssh, rsh, etc... options for $bootstrap.
Any other output file and error msg info would be helpful, if you could share those.
Is this problem MRCC specific? Can you run other programs with similar setup?
Can you run MRCC via sbatch without MPI relying on MPI?
Do the test jobs run correctly?
Is login with ssh/rsh allowed between nodes? Our code needs to spawn the scf_mpi and mrcc_mpi processes via an MPI_Comm_spawn call in dmrcc_mpi and this has to be allowed.
I hope some of this helps.
Best wishes,
Peter
Please Log in or Create an account to join the conversation.
- diefenbach
- Topic Author
- Offline
- New Member
Less
More
- Posts: 5
- Thank you received: 0
3 years 11 months ago #1034
by diefenbach
Replied by diefenbach on topic MRCC MPI/OpenMP and SLURM
Dear Peter,
many thanks for the reply!
Your suggestion to change the bootstrap server actually does the trick!
If I use dmrcc_mpi with mpirun instead of dmrcc, i.e.
the job finishes to completion running 2 tasks with 40 threads each.
There is however, an error message at the very end stating fort.17, which only appears in combination with MPI (also without SLURM):
Is this something to worry about?
I am running the following input:
Just as a side note -
The original problem with SLURM was specific to MRCC in combination with MPI. Jobs work without any issues when just asking for OpenMP threading within SLURM (i.e., jobs without "mpitasks=..." in the input file). Also other programs work regularly with sbatch/SLURM, e.g. Molpro with MPI/OpenMP hybrid parallel jobs.
Best wishes,
Martin
many thanks for the reply!
Your suggestion to change the bootstrap server actually does the trick!
If I use dmrcc_mpi with mpirun instead of dmrcc, i.e.
Code:
# dmrcc &> mrcc.out
mpirun -np 1 -bootstrap ssh dmrcc_mpi &> mrcc.out
There is however, an error message at the very end stating fort.17, which only appears in combination with MPI (also without SLURM):
Code:
Total CCSDT[Q] energy [au]: -55.792496754758
Total CCSDT(Q)/A energy [au]: -55.792638783431
Total CCSDT(Q)/B energy [au]: -55.792641166686
Fatal error in cp fort.17 .. 2> /dev/null.
Program will stop.
************************ 2020-12-16 17:21:20 ************************
Error at the termination of mrcc.
*********************************************************************
************************ 2020-12-16 17:21:22 ************************
Normal termination of mrcc.
*********************************************************************
I am running the following input:
Code:
basis=cc-pVTZ
calc=CCSDT(Q)
scftype=ROHF
mult=2
cctol=12
mem=8GB
mpitasks=2
unit=bohr
geom
H
N 1 R1
H 2 R1 1 A
R1=2.00000000000
A=104.2458898548
Just as a side note -
The original problem with SLURM was specific to MRCC in combination with MPI. Jobs work without any issues when just asking for OpenMP threading within SLURM (i.e., jobs without "mpitasks=..." in the input file). Also other programs work regularly with sbatch/SLURM, e.g. Molpro with MPI/OpenMP hybrid parallel jobs.
Best wishes,
Martin
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 11 months ago #1036
by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,
I glad that the it worked out for you.
You can ignore the second issue with the fort.17 copy, the energies are fine.
Thank you for pointing it out, this will be fixed in the next release.
It you wish, you can replace in mrcc.f the lines
if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
call ishell('cp fort.17 .. 2> /dev/null')
call ishell('cp fort.63 .. 2> /dev/null')
end if
by these
if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
inquire(file='fort.17',exist=lll)
if (lll) call ishell('cp fort.17 .. 2> /dev/null')
inquire(file='fort.63',exist=lll)
if (lll) call ishell('cp fort.63 .. 2> /dev/null')
end if
and recompile.
Best wishes,
Peter
I glad that the it worked out for you.
You can ignore the second issue with the fort.17 copy, the energies are fine.
Thank you for pointing it out, this will be fixed in the next release.
It you wish, you can replace in mrcc.f the lines
if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
call ishell('cp fort.17 .. 2> /dev/null')
call ishell('cp fort.63 .. 2> /dev/null')
end if
by these
if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
inquire(file='fort.17',exist=lll)
if (lll) call ishell('cp fort.17 .. 2> /dev/null')
inquire(file='fort.63',exist=lll)
if (lll) call ishell('cp fort.63 .. 2> /dev/null')
end if
and recompile.
Best wishes,
Peter
Please Log in or Create an account to join the conversation.
- diefenbach
- Topic Author
- Offline
- New Member
Less
More
- Posts: 5
- Thank you received: 0
3 years 11 months ago #1037
by diefenbach
Replied by diefenbach on topic MRCC MPI/OpenMP and SLURM
Dear Peter,
thanks again for your support! If the copy message may be ignored, I might as well do so for now and wait for the next binary release.
Concerning the original issue on MPI and SLURM apparently the culprit is the pre-installed 2019.5 version of IntelMPI on our cluster. After installation of the version 2019.3 recommended in the manual, dmrcc works just as intended:
I actually did not expect such an impact of a minor version release, and I am not sure about the origin of the issue, but there might have been a change in settings for the HYDRA process manager in IntelMPI 2019.5...
Anyways, everything works fine now on Intel-based machines! There is, however, another issue with AMD Epyc (Zen2) based architectures - I shall open a new thread for that.
Best regards,
Martin
thanks again for your support! If the copy message may be ignored, I might as well do so for now and wait for the next binary release.
Concerning the original issue on MPI and SLURM apparently the culprit is the pre-installed 2019.5 version of IntelMPI on our cluster. After installation of the version 2019.3 recommended in the manual, dmrcc works just as intended:
Code:
#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=40
export OMP_NUM_THREADS=40
export MKL_NUM_THREADS=40
MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0
export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> ${HOME}/mrcc.out
exit
I actually did not expect such an impact of a minor version release, and I am not sure about the origin of the issue, but there might have been a change in settings for the HYDRA process manager in IntelMPI 2019.5...
Anyways, everything works fine now on Intel-based machines! There is, however, another issue with AMD Epyc (Zen2) based architectures - I shall open a new thread for that.
Best regards,
Martin
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 11 months ago #1038
by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,
I am glad that the intel side works well now.
We did not have any experience with the 2019.5 IntelMPI version, it is good to know about this.
Please, indeed open a new thread for the AMD question. Before that, could you have a look at the thread below? It two could be related.
www.mrcc.hu/index.php/forum/running-mrcc...d-2020-binaries#1016
Best wishes,
Peter
I am glad that the intel side works well now.
We did not have any experience with the 2019.5 IntelMPI version, it is good to know about this.
Please, indeed open a new thread for the AMD question. Before that, could you have a look at the thread below? It two could be related.
www.mrcc.hu/index.php/forum/running-mrcc...d-2020-binaries#1016
Best wishes,
Peter
Please Log in or Create an account to join the conversation.
Time to create page: 0.051 seconds