If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:

the way mrcc was invoked
the way build.mrcc was invoked
the output of build.mrcc
compiler version (for example: ifort -V, gfortran -v)
blas/lapack versions
as well as gcc and glibc versions

This information really helps us during troubleshooting

MRCC MPI crashes on AMD Epyc (oom-kill)

diefenbach
Topic Author
Offline
New Member

4 years 9 months ago #1039 by diefenbach

MRCC MPI crashes on AMD Epyc (oom-kill) was created by diefenbach

Dear all,

when running the current MRCC binary (2020-02-22) with MPI parallelism on an AMD Epyc (Zen2) architecture, the calculation hangs at the dmrcc_mpi process, which consumes all of the resident memory and then crashes with an oom-kill event:

ps xl

Code:

  PID   PPID  WCHAN  STAT  TIME COMMAND
 8113  do_wai S     0:00 /bin/bash /var/spool/slurm/d/job486747/slurm_script
 8118  do_wai S     0:00 /bin/sh /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/mpirun -np 1 dmrcc_mpi
 8137  poll_s S     0:00 mpiexec.hydra -np 1 dmrcc_mpi
 8142  poll_s Ss    0:00 /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1
 8143 6614392 - R   1:12 dmrcc_mpi

top

Code:

KiB Mem : 52823801+total, 42264691+free, 10068935+used,  4901744 buff/cache
KiB Swap: 13421772+total, 13420800+free,     9728 used. 42610931+avail Mem
   PID   PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  8741   20   0   80.3g  80.2g   2168 R  99.7 15.9   0:38.76 dmrcc_mpi

Memory consumption continues until total Mem is filled and oom-kill is issued:

Code:

slurmstepd: error: Detected 1 oom-kill event(s) in step 486747.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Has anyone access to AMD Epyc machines and perhaps also encountered this issue, and possibly come up with a solution?

Cheers,
Martin

Below is the job script, which runs fine on Intel systems, but crashes when run on AMD Epyc (regardless of submission type, e.g., batch queueing or interactive usage):

Code:

#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0

export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> mrcc.out
exit

mrcc.inp:

Code:

basis=cc-pVTZ
calc=CCSDT(Q)
scftype=ROHF
mult=2
mem=8GB
cctol=12
mpitasks=2

unit=bohr
geom
H
N 1 R1
H 2 R1 1 A

R1=2.00000000000
A=104.2458898548

Please Log in or Create an account to join the conversation.

nagypeter
Offline
Premium Member
MRCC developer

4 years 9 months ago #1040 by nagypeter

Replied by nagypeter on topic MRCC MPI crashes on AMD Epyc (oom-kill)

Dear Martin,

unfortunately, we currently do not have access to AMD processors.
I assume the AMD CPUs again work for jobs without MPI and fail only with MPI.
Do you see any other process besides dmrcc_mpi? Perhaps scf_mpi?

If you do not see the scf_mpi processes, the issue is possibly again
in an MPI_Comm_spawn call responsible for spawning the scf_mpi processes,
and could be a similar internode communication issue as before. There could be something AMD (or non-Intel) specific issue we are unaware of.
Did you try to play around with the suggestions for the previous problem?

You may also try to compile MRCC from source and link with
- either OpenMPI instead of IntelMPI (please see that manual first about the
required OpenMPI version and patch)
- or a more recent IntelMPI hoping for better AMD support.

You are again welcome the share some cluster documentation, if there is any.
I hope some of them helps. I am sorry that this is not much help so far.

Best of luck,
Peter & Laszlo

Please Log in or Create an account to join the conversation.

diefenbach
Topic Author
Offline
New Member

4 years 9 months ago #1046 by diefenbach

Replied by diefenbach on topic MRCC MPI crashes on AMD Epyc (oom-kill)

Dear Peter and Laszlo,

thanks for your suggestions. Yes, the error occurs with MPI only, and it occurs also if the calculation is run on a single node (ie, not only for runs across multiple nodes). Serial or OMP parallel runs work fine.

In the meantime I played around with a few different IntelMPI versions, including 2019.3, 2019.5, and 2021.1 - the latter is part of the current OneAPI release. In conjunction with the MRCC binary of 2020-02-22 all of these lead to the same result, where dmrcc_mpi consumes all of the resident memory and crashes after calling an oom-event. There is no scf_mpi process appearing.

However, compilation of the latest 2020-02-22 source code with the pre-installed ifort 2019.5.281 on our cluster and MKL/InteMPI 2019.3.199, solved this memory issue:

Code:

INTEL_MKL_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
INTEL_MPI_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
INTEL_CMP_ROOT=/cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux
export PATH=${INTEL_CMP_ROOT}/bin/intel64:${PATH}
export LD_LIBRARY_PATH=${INTEL_CMP_ROOT}/compiler/lib/intel64:${LD_LIBRARY_PATH}
source ${INTEL_MKL_ROOT}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_MPI_ROOT}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0
./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI

Apparently, I can reproduce the MPI error with the oom-event, if MRCC is compiled using the same compiler/MPI versions, but with statically linked libraries via

Code:

./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI -s

I am not sure about the origin of this behaviour, but it may be connected to libimf.so?

Just for your information - the cluster hardware configuration this was tested on includes Intel and AMD compute nodes with EDR InfiniBand interconnects, where the Intel nodes are dual socket Intel Xeon Gold 6148 (Skylake) with 20 cores per socket (40 cores) and 192 GB RAM per node and the AMD nodes are dual socket AMD EPYC 7452 (Zen2) with 32 cores per socket (64 cores) and 512 GB RAM per node.

Anyways, apparently my issue with AMD and MPI is solved by compiling MRCC from source using ifort 2019 with dynamically linked libraries.

Best wishes for the new year!
Martin

Please Log in or Create an account to join the conversation.

Time to create page: 0.038 seconds

MRCC MPI crashes on AMD Epyc (oom-kill)

MRCC Login