- Posts: 5
- Thank you received: 0
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
MRCC MPI crashes on AMD Epyc (oom-kill)
- diefenbach
- Topic Author
- Offline
- New Member
Less
More
3 years 11 months ago #1039
by diefenbach
MRCC MPI crashes on AMD Epyc (oom-kill) was created by diefenbach
Dear all,
when running the current MRCC binary (2020-02-22) with MPI parallelism on an AMD Epyc (Zen2) architecture, the calculation hangs at the dmrcc_mpi process, which consumes all of the resident memory and then crashes with an oom-kill event:
ps xl
top
Memory consumption continues until total Mem is filled and oom-kill is issued:
Has anyone access to AMD Epyc machines and perhaps also encountered this issue, and possibly come up with a solution?
Cheers,
Martin
Below is the job script, which runs fine on Intel systems, but crashes when run on AMD Epyc (regardless of submission type, e.g., batch queueing or interactive usage):
mrcc.inp:
when running the current MRCC binary (2020-02-22) with MPI parallelism on an AMD Epyc (Zen2) architecture, the calculation hangs at the dmrcc_mpi process, which consumes all of the resident memory and then crashes with an oom-kill event:
ps xl
Code:
PID PPID WCHAN STAT TIME COMMAND
8118 8113 do_wai S 0:00 /bin/bash /var/spool/slurm/d/job486747/slurm_script
8137 8118 do_wai S 0:00 /bin/sh /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/mpirun -np 1 dmrcc_mpi
8142 8137 poll_s S 0:00 mpiexec.hydra -np 1 dmrcc_mpi
8143 8142 poll_s Ss 0:00 /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1
8146 8143 6614392 - R 1:12 dmrcc_mpi
top
Code:
KiB Mem : 52823801+total, 42264691+free, 10068935+used, 4901744 buff/cache
KiB Swap: 13421772+total, 13420800+free, 9728 used. 42610931+avail Mem
PID PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8741 20 0 80.3g 80.2g 2168 R 99.7 15.9 0:38.76 dmrcc_mpi
Code:
slurmstepd: error: Detected 1 oom-kill event(s) in step 486747.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Has anyone access to AMD Epyc machines and perhaps also encountered this issue, and possibly come up with a solution?
Cheers,
Martin
Below is the job script, which runs fine on Intel systems, but crashes when run on AMD Epyc (regardless of submission type, e.g., batch queueing or interactive usage):
Code:
#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0
export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> mrcc.out
exit
mrcc.inp:
Code:
basis=cc-pVTZ
calc=CCSDT(Q)
scftype=ROHF
mult=2
mem=8GB
cctol=12
mpitasks=2
unit=bohr
geom
H
N 1 R1
H 2 R1 1 A
R1=2.00000000000
A=104.2458898548
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 11 months ago #1040
by nagypeter
Replied by nagypeter on topic MRCC MPI crashes on AMD Epyc (oom-kill)
Dear Martin,
unfortunately, we currently do not have access to AMD processors.
I assume the AMD CPUs again work for jobs without MPI and fail only with MPI.
Do you see any other process besides dmrcc_mpi? Perhaps scf_mpi?
If you do not see the scf_mpi processes, the issue is possibly again
in an MPI_Comm_spawn call responsible for spawning the scf_mpi processes,
and could be a similar internode communication issue as before. There could be something AMD (or non-Intel) specific issue we are unaware of.
Did you try to play around with the suggestions for the previous problem?
You may also try to compile MRCC from source and link with
- either OpenMPI instead of IntelMPI (please see that manual first about the
required OpenMPI version and patch)
- or a more recent IntelMPI hoping for better AMD support.
You are again welcome the share some cluster documentation, if there is any.
I hope some of them helps. I am sorry that this is not much help so far.
Best of luck,
Peter & Laszlo
unfortunately, we currently do not have access to AMD processors.
I assume the AMD CPUs again work for jobs without MPI and fail only with MPI.
Do you see any other process besides dmrcc_mpi? Perhaps scf_mpi?
If you do not see the scf_mpi processes, the issue is possibly again
in an MPI_Comm_spawn call responsible for spawning the scf_mpi processes,
and could be a similar internode communication issue as before. There could be something AMD (or non-Intel) specific issue we are unaware of.
Did you try to play around with the suggestions for the previous problem?
You may also try to compile MRCC from source and link with
- either OpenMPI instead of IntelMPI (please see that manual first about the
required OpenMPI version and patch)
- or a more recent IntelMPI hoping for better AMD support.
You are again welcome the share some cluster documentation, if there is any.
I hope some of them helps. I am sorry that this is not much help so far.
Best of luck,
Peter & Laszlo
Please Log in or Create an account to join the conversation.
- diefenbach
- Topic Author
- Offline
- New Member
Less
More
- Posts: 5
- Thank you received: 0
3 years 10 months ago #1046
by diefenbach
Replied by diefenbach on topic MRCC MPI crashes on AMD Epyc (oom-kill)
Dear Peter and Laszlo,
thanks for your suggestions. Yes, the error occurs with MPI only, and it occurs also if the calculation is run on a single node (ie, not only for runs across multiple nodes). Serial or OMP parallel runs work fine.
In the meantime I played around with a few different IntelMPI versions, including 2019.3, 2019.5, and 2021.1 - the latter is part of the current OneAPI release. In conjunction with the MRCC binary of 2020-02-22 all of these lead to the same result, where dmrcc_mpi consumes all of the resident memory and crashes after calling an oom-event. There is no scf_mpi process appearing.
However, compilation of the latest 2020-02-22 source code with the pre-installed ifort 2019.5.281 on our cluster and MKL/InteMPI 2019.3.199, solved this memory issue:
Apparently, I can reproduce the MPI error with the oom-event, if MRCC is compiled using the same compiler/MPI versions, but with statically linked libraries via
I am not sure about the origin of this behaviour, but it may be connected to libimf.so?
Just for your information - the cluster hardware configuration this was tested on includes Intel and AMD compute nodes with EDR InfiniBand interconnects, where the Intel nodes are dual socket Intel Xeon Gold 6148 (Skylake) with 20 cores per socket (40 cores) and 192 GB RAM per node and the AMD nodes are dual socket AMD EPYC 7452 (Zen2) with 32 cores per socket (64 cores) and 512 GB RAM per node.
Anyways, apparently my issue with AMD and MPI is solved by compiling MRCC from source using ifort 2019 with dynamically linked libraries.
Best wishes for the new year!
Martin
thanks for your suggestions. Yes, the error occurs with MPI only, and it occurs also if the calculation is run on a single node (ie, not only for runs across multiple nodes). Serial or OMP parallel runs work fine.
In the meantime I played around with a few different IntelMPI versions, including 2019.3, 2019.5, and 2021.1 - the latter is part of the current OneAPI release. In conjunction with the MRCC binary of 2020-02-22 all of these lead to the same result, where dmrcc_mpi consumes all of the resident memory and crashes after calling an oom-event. There is no scf_mpi process appearing.
However, compilation of the latest 2020-02-22 source code with the pre-installed ifort 2019.5.281 on our cluster and MKL/InteMPI 2019.3.199, solved this memory issue:
Code:
INTEL_MKL_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
INTEL_MPI_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
INTEL_CMP_ROOT=/cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux
export PATH=${INTEL_CMP_ROOT}/bin/intel64:${PATH}
export LD_LIBRARY_PATH=${INTEL_CMP_ROOT}/compiler/lib/intel64:${LD_LIBRARY_PATH}
source ${INTEL_MKL_ROOT}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_MPI_ROOT}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0
./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI
Apparently, I can reproduce the MPI error with the oom-event, if MRCC is compiled using the same compiler/MPI versions, but with statically linked libraries via
Code:
./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI -s
I am not sure about the origin of this behaviour, but it may be connected to libimf.so?
Just for your information - the cluster hardware configuration this was tested on includes Intel and AMD compute nodes with EDR InfiniBand interconnects, where the Intel nodes are dual socket Intel Xeon Gold 6148 (Skylake) with 20 cores per socket (40 cores) and 192 GB RAM per node and the AMD nodes are dual socket AMD EPYC 7452 (Zen2) with 32 cores per socket (64 cores) and 512 GB RAM per node.
Anyways, apparently my issue with AMD and MPI is solved by compiling MRCC from source using ifort 2019 with dynamically linked libraries.
Best wishes for the new year!
Martin
Please Log in or Create an account to join the conversation.
Time to create page: 0.039 seconds