- Posts: 8
- Thank you received: 0
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
Multithreaded performance of DF-CCSD(T)
- ddatta
- Topic Author
- Offline
- New Member
I am trying to compare the multithreaded performance of the DF-CCSD(T) algorithm in MRCC with the DF-CCSD(T) codes implemented in certain other program suites. For treating all programs with an equal footing, I plan to use the sequential BLAS. I compiled MRCC with the Intel compiler with -pMPI and -pOMP.
1. When running the program, how do I distribute the OpenMP threads set by OMP_NUM_THREADS=n among the outer and the nested parallel regions?
2. If I choose to employ all threads set by OMP_NUM_THREADS=n only to the outer OMP region, while using a single thread in the inner OMP region, would it work if ccsdthreads is set to n (likewise for (T), ptthreads is set to n)? And would it lead to the best performance of the code?
Here is the relevant top part of the input file that I am using:basis=cc-pVDZ
dfbasis_scf=none
dfbasis_cor=cc-pVDZ-RI
calc=DF-CCSD(T)
cctol=7
ccsdmkl=seq
ccsdthreads=4
ptthreads=4
mem=50GB
It would be a great help to receive suggestions from the developers.
Many thanks in advance.
Please Log in or Create an account to join the conversation.
- ddatta
- Topic Author
- Offline
- New Member
- Posts: 8
- Thank you received: 0
ddatta wrote: Hello,
I am trying to compare the multithreaded performance of the DF-CCSD(T) algorithm in MRCC with the DF-CCSD(T) codes implemented in certain other program suites. For treating all programs with an equal footing, I plan to use the sequential BLAS. I compiled MRCC with the Intel compiler with -pMPI and -pOMP.
1. When running the program, how do I distribute the OpenMP threads set by OMP_NUM_THREADS=n among the outer and the nested parallel regions?
2. If I choose to employ all threads set by OMP_NUM_THREADS=n only to the outer OMP region, while using a single thread in the inner OMP region, would it work if ccsdthreads is set to n (likewise for (T), ptthreads is set to n)? And would it lead to the best performance of the code?
Here is the relevant top part of the input file that I am using:basis=cc-pVDZ
dfbasis_scf=none
dfbasis_cor=cc-pVDZ-RI
calc=DF-CCSD(T)
cctol=7
ccsdmkl=seq
ccsdthreads=4
ptthreads=4
mem=50GB
It would be a great help to receive suggestions from the developers.
Many thanks in advance.
The Intel compiler version 18.3 was used.
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
thank you very much for the interest.
We also experienced recently that such comparisons are quite hard to do fairly
because instead of using the same settings for treatment on an equal footing
one should try to compare the codes at their best performance.
For that end, I would not recommend using sequential BLAS with our DF-CCSD(T).
When compiled with ifort the best performance is obtained with threaded MKL.
To your questions:
1) OMP_NUM_THREADS=n is indeed the total OpenMP threads, which is the outer number of threads (ccsdthreads or ptthreads) times number of nested threads (the latter is preferably the number of threads used by threaded MKL).
2) Technically you can set ccsdthreads=n and ptthreads=n, but I would strongly
advice against it. The best performance should be closer to the ccsdthreads=ptthreads=m=number of CPU sockets/NUMA nodes setting and n/m for the nested threads within threaded BLAS. There might be useful to increase m a bit if the number of cores in the node/CPU is large and the scaling or the threaded BLAS starts to deteriorate. This could be system dependent, I would experiment a bit with n/m around 5.
cctol=6 should be more than enough with DF approximation.
You did not ask about it, but using a few MPI theads on nodes with many cores could also increase performance. Regarding more OpenMP and MPI performance details you may also want to have a look at some of our recent papers including related DF-CCSD(T) timing data (both open access):
pubs.acs.org/doi/abs/10.1021/acs.jctc.9b00957
pubs.acs.org/doi/abs/10.1021/acs.jctc.0c01077
I would also be happy to suggest more specific settings.
For that please share some details:
- number of cores/sockets per node
- memory per node accessible for the job
- do you want only OpenMP threaded or also MPI tasks?
- what is the number of occupied, virtual and auxiliary functions in your system?
I hope this helped and you can find the best settings for your hardware and molecule. I would be happy to assist further with that.
Best wishes,
Peter
Please Log in or Create an account to join the conversation.
- ddatta
- Topic Author
- Offline
- New Member
- Posts: 8
- Thank you received: 0
Many thanks for your detailed response. This is very helpful indeed. The molecular system that I am using is (H2O)10 using the cc-pVDZ/cc-pVDZ-RI basis sets. My goal is to run the DF-CCSD(T) code in MRCC for this system on a single node with 1, 2, 4, 8, and 16 OpenMP threads (this means with 1-16 total OpenMP threads). In short, to reproduce the data of Figure 1 that you published last year
pubs.acs.org/doi/abs/10.1021/acs.jctc.9b00957 .
In fact, we have an RI-CCSD(T) implementation in GAMESS and I would like to compare its single-node multithreaded performance with the DF-CCSD(T) code in MRCC. The GAMESS code does not use threaded BLAS or the nested OpenMP. This is why I ran MRCC with MKL_NUM_THREADS=1 and ccsdmkl=seq, and also with ccsdthreads/ptthreads=OMP_NUM_THREADS. However, this does not give the best performance as you mentioned.
So, I wanted to make sure that we have the best performance of the DF-CCSD(T) code in MRCC for comparison.
Here are some further details about the HPC cluster that I am using:
It has 2 CPU sockets per node with 18 physical cores per socket and 1 hardware thread per core.
Available node memory is up to 126 GiB.
Number of correlated occupied MOs is 40, number of virtual MOs is 190, and the number of auxiliary basis sets is 840 for (H2O)10 using cc-pVDZ/cc-pVDZ-RI bases
Thank you once again.
Best wishes,
Dipayan
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
the absolute best performance in the paper is found on 16 cores with
2 MPI tasks, 2 outer OpenMP threads (for n>2) and threaded MKL for the remaining threads. Figs. 5 and 6 show that using the 2 MPI tasks accelerates both CCSD and (T) by about 5-10% compared to the 2 outer OpenMP plus threaded MKL.
If you do not want to use MPI, for the best OpenMP-only performance you should try both 2 and 4 outer OpenMP threads.
For such dual socket nodes we also recommend to try
OMP_PLACES=cores
OMP_PROC_BIND=spread,close
Again, the default ccsdmkl=thr is strongly recommended.
I did not really find the (H2O)10 example too informative because the cc-pVDZ basis set and a system itself is quite small for such tests (25.4 s runtime for CCSD). For such systems you do not really need excellent parallel scaling, especially if your CPU employs AVX-512...
But previous studies used that example, so that was good for comparison.
A larger system would probably be more informative, that is why we also added the data in Table 4.
Good luck with your timing measurement, I hope you can find the best settings.
Feel free to share any further questions,
Peter
Please Log in or Create an account to join the conversation.