- Posts: 10
- Thank you received: 0
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
Best practices for parallel performance/scaling
- MXvo5e35
- Topic Author
- Offline
- New Member
Less
More
3 years 4 months ago #1126
by MXvo5e35
Best practices for parallel performance/scaling was created by MXvo5e35
I'm interested in applying the higher-order CC methods (both with and without perturbative corrections) in MRCC to relatively large problems, and I'm wondering if there are any suggestions on how best to structure jobs for parallel efficiency and problem feasibility.
I have had a poke around the forum history, and there are a couple of posts discussing performance; for example, this post re: optimal numbers of processors , and this one regarding slow CCSDT calculations . The former was made several years ago, and the release notes suggest that the implementation may have been tweaked/tuned since then. The latter recommends using "more memory" but the process by which the recommended numbers were arrived at isn't immediately clear to me.
To make my problem a bit more concrete: I'm interested in applying standard non-DF CC(n) and CC(n)(n+1) to as high orders as possible, for systems composed of mostly first-row species and equipped with hundreds of basis functions. (Thousands would be nice, obviously, but I doubt that's going to be feasible in general.) The nature of my problem space precludes the use of symmetry, so everything is C1. I am also interested in obtaining results both with and without frozen cores.
My execution environment is a moderately recent compute cluster, with a couple of hundred nodes, each with 2x16 core Xeon processors and around 192GB of RAM. The shared filesystem is underperformant, so I'll be using node-local working directories.
Some explicit questions to get the discussion started:
1) From watching execution, it looks to me as if the non-root MPI processes do not explicitly store copies of integrals (AO or transformed MO) on disk; rather, only the root process of the relevant spawned communicator needs to do this. Is this correct, or have I misunderstood?
2) When using multiple MPI processes, how does the overall memory load of the calculation decompose in parallel? For example, does every process need to store all the various amplitudes? Or is the storage of the amplitudes split more-or-less evenly across the processors? (This is important, because if so, it would imply that I can pool the memory of multiple nodes to attack really large problems. If not, I will be limited by the maximum memory available on a single node.)
3) Some test calculations I've done suggest that OpenMP thread-based parallelism performs slightly (a few percent) better than MPI process-based parallelism for the same core count. Is it reasonable then to expect a "good" hybrid decomposition to be one or two MPI ranks per compute node, with OpenMP doing the rest? Or is there perhaps some benefit to using exclusively MPI parallelism?
4) As mentioned above, one of the previous posts suggests that "more memory" is a viable strategy for improving performance. Is there an obvious way to read the "optimal" memory for a calculation from the output log? (I do see some lines mentioning "optimal" values but, at least in the cases I've tried, they seem to be identical to "minimal" values...?)
I have had a poke around the forum history, and there are a couple of posts discussing performance; for example, this post re: optimal numbers of processors , and this one regarding slow CCSDT calculations . The former was made several years ago, and the release notes suggest that the implementation may have been tweaked/tuned since then. The latter recommends using "more memory" but the process by which the recommended numbers were arrived at isn't immediately clear to me.
To make my problem a bit more concrete: I'm interested in applying standard non-DF CC(n) and CC(n)(n+1) to as high orders as possible, for systems composed of mostly first-row species and equipped with hundreds of basis functions. (Thousands would be nice, obviously, but I doubt that's going to be feasible in general.) The nature of my problem space precludes the use of symmetry, so everything is C1. I am also interested in obtaining results both with and without frozen cores.
My execution environment is a moderately recent compute cluster, with a couple of hundred nodes, each with 2x16 core Xeon processors and around 192GB of RAM. The shared filesystem is underperformant, so I'll be using node-local working directories.
Some explicit questions to get the discussion started:
1) From watching execution, it looks to me as if the non-root MPI processes do not explicitly store copies of integrals (AO or transformed MO) on disk; rather, only the root process of the relevant spawned communicator needs to do this. Is this correct, or have I misunderstood?
2) When using multiple MPI processes, how does the overall memory load of the calculation decompose in parallel? For example, does every process need to store all the various amplitudes? Or is the storage of the amplitudes split more-or-less evenly across the processors? (This is important, because if so, it would imply that I can pool the memory of multiple nodes to attack really large problems. If not, I will be limited by the maximum memory available on a single node.)
3) Some test calculations I've done suggest that OpenMP thread-based parallelism performs slightly (a few percent) better than MPI process-based parallelism for the same core count. Is it reasonable then to expect a "good" hybrid decomposition to be one or two MPI ranks per compute node, with OpenMP doing the rest? Or is there perhaps some benefit to using exclusively MPI parallelism?
4) As mentioned above, one of the previous posts suggests that "more memory" is a viable strategy for improving performance. Is there an obvious way to read the "optimal" memory for a calculation from the output log? (I do see some lines mentioning "optimal" values but, at least in the cases I've tried, they seem to be identical to "minimal" values...?)
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 4 months ago #1127
by nagypeter
Replied by nagypeter on topic Best practices for parallel performance/scaling
Dear MXvo5e35,
There are additional details on the parallel scaling performance in the paper documenting the MRCC features:
aip.scitation.org/doi/abs/10.1063/1.5142048
As noted in Sect. II.L, the MPI scaling of iterative CC methods
(with the exception of the hand-written MPI-OpenMP parallel CCSD and the OpenMP only CC2 codes)
is rather limited due to high I/O costs.
The perturbative methods, however, should scale well with both OpenMP and MPI.
The integral files are broadcasted to all compute nodes, unless they all use the same network file system. To parallelize the I/O of MPI tasks, you should use local hard disks.
The memory is also replicated in case of CC(n) (any any other CC methods using the general-order CC code via ccprog=mrcc), so you are currently limited by the memory of a single node.
For that reason, a hibrid OpenMP-MPI execution is recommended as the memory increase with the OpenMP threads is quite small. If the MPI part scales well and you have sufficient per-node memory, increasing the MPI tasks to 2 (or 4) could be beneficial.
I am not sure about your target applications, but just to let you know, thousands of orbitals are out of the question, except for CC2 and CCSD(T).
Few hundred orbitals seems to be challenging too, especially with all-electron and C1, unless you have a very small number of occupied MOs.
Maybe you should experiment a bit starting with less than 100 orbitals.
Nike also has extensive experience on large-scale calcs. If he could share his largest examples, that could be also useful and would be appreciated.
Best wishes,
Peter
There are additional details on the parallel scaling performance in the paper documenting the MRCC features:
aip.scitation.org/doi/abs/10.1063/1.5142048
As noted in Sect. II.L, the MPI scaling of iterative CC methods
(with the exception of the hand-written MPI-OpenMP parallel CCSD and the OpenMP only CC2 codes)
is rather limited due to high I/O costs.
The perturbative methods, however, should scale well with both OpenMP and MPI.
The integral files are broadcasted to all compute nodes, unless they all use the same network file system. To parallelize the I/O of MPI tasks, you should use local hard disks.
The memory is also replicated in case of CC(n) (any any other CC methods using the general-order CC code via ccprog=mrcc), so you are currently limited by the memory of a single node.
For that reason, a hibrid OpenMP-MPI execution is recommended as the memory increase with the OpenMP threads is quite small. If the MPI part scales well and you have sufficient per-node memory, increasing the MPI tasks to 2 (or 4) could be beneficial.
I am not sure about your target applications, but just to let you know, thousands of orbitals are out of the question, except for CC2 and CCSD(T).
Few hundred orbitals seems to be challenging too, especially with all-electron and C1, unless you have a very small number of occupied MOs.
Maybe you should experiment a bit starting with less than 100 orbitals.
Nike also has extensive experience on large-scale calcs. If he could share his largest examples, that could be also useful and would be appreciated.
Best wishes,
Peter
The following user(s) said Thank You: MXvo5e35
Please Log in or Create an account to join the conversation.
- MXvo5e35
- Topic Author
- Offline
- New Member
Less
More
- Posts: 10
- Thank you received: 0
3 years 4 months ago #1128
by MXvo5e35
Replied by MXvo5e35 on topic Best practices for parallel performance/scaling
Hi Peter,
Thanks for the quick response!
I was joking about thousands of basis functions.
Definitely the memory replication rather than distribution will be a limiting factor for me -- but that's OK, I'm mostly asking just to get a sense for what kind of calculations will and won't be feasible.
One technical question re: usage of node-local storage. Is there a tidy way in your experience to explicitly specify the working directory for the MPI ranks? It seems from what I see that the working directory is just inherited from that of the main dmrcc process, and I haven't been able to find anything in the manual about this.
Thanks for the quick response!
I was joking about thousands of basis functions.
Definitely the memory replication rather than distribution will be a limiting factor for me -- but that's OK, I'm mostly asking just to get a sense for what kind of calculations will and won't be feasible.
One technical question re: usage of node-local storage. Is there a tidy way in your experience to explicitly specify the working directory for the MPI ranks? It seems from what I see that the working directory is just inherited from that of the main dmrcc process, and I haven't been able to find anything in the manual about this.
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 4 months ago #1129
by nagypeter
Replied by nagypeter on topic Best practices for parallel performance/scaling
Hi,
All MPI tasks should create and run in a folder with the same PATH as that of the driver dmrcc code, which runs only on one of the nodes. Thus all working directory PATHs should be the same on all local disks, if needed they should be created by the slave tasks in case of normal termination.
I am not sure what else would you need, let me know.
For now, especially in case of error termination this could me messy and you need to clean up the temporary files.
I was not joking about CCSD(T) with a thousand orbitals
Best wishes,
Peter
All MPI tasks should create and run in a folder with the same PATH as that of the driver dmrcc code, which runs only on one of the nodes. Thus all working directory PATHs should be the same on all local disks, if needed they should be created by the slave tasks in case of normal termination.
I am not sure what else would you need, let me know.
For now, especially in case of error termination this could me messy and you need to clean up the temporary files.
I was not joking about CCSD(T) with a thousand orbitals
Best wishes,
Peter
Please Log in or Create an account to join the conversation.
- MXvo5e35
- Topic Author
- Offline
- New Member
Less
More
- Posts: 10
- Thank you received: 0
3 years 4 months ago - 3 years 4 months ago #1130
by MXvo5e35
Replied by MXvo5e35 on topic Best practices for parallel performance/scaling
Is that a conventional CCSD(T), though, or only one using density fitting? It seems from my reading of the manual that the fast MPI-based CCSD and CCSD(T) implementations are only supported for DF, which unfortunately isn't useful for what I'm doing. (It might be in the future, but DF adds more "degrees of freedom" to consider when benchmarking for accuracy than I want to deal with right now.)
Last edit: 3 years 4 months ago by MXvo5e35.
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 4 months ago #1131
by nagypeter
Replied by nagypeter on topic Best practices for parallel performance/scaling
Yes, only the DF based CCSD(T) is OpenMP+MPI parallel and can treat 1000-1500 orbitals. The conventional CCSD(T) code is only OpenMP parallel, but still ccprog=ccsd is much faster for conventional CCSD and CCSD(T) than ccprog=mrcc (even with MPI for the latter). You will probably run out of your resources with the conventional CCSD(T) somewhere around 600-800 orbitals...
I saw that you probably need higher order than CCSD(T), I just noted these as it is common in some composite schemes to employ basis set incompleteness corrections at a lower level, e.g., CCSD(T), for which both the DF and non-DF CCSD(T) could be of use.
(DF is usually a very good approximation, e.g., compared to the remaining basis set incompleteness error, but of course you need to decide if that is tolerable for your case)
I saw that you probably need higher order than CCSD(T), I just noted these as it is common in some composite schemes to employ basis set incompleteness corrections at a lower level, e.g., CCSD(T), for which both the DF and non-DF CCSD(T) could be of use.
(DF is usually a very good approximation, e.g., compared to the remaining basis set incompleteness error, but of course you need to decide if that is tolerable for your case)
Please Log in or Create an account to join the conversation.
Time to create page: 0.043 seconds