- Posts: 10
- Thank you received: 0
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
Best practices for parallel performance/scaling
- MXvo5e35
- Topic Author
- Offline
- New Member
Less
More
3 years 2 months ago #1132
by MXvo5e35
Replied by MXvo5e35 on topic Best practices for parallel performance/scaling
OK, thanks for the clarification. I do appreciate the pointers!
Your idea re: the use of DF for extrapolation is indeed interesting. In fact, the basic setting of my problem revolves around calibration of a composite scheme, so I'm already considering these approaches. As I mentioned, DF adds some more caveats to the extrapolation process, and I have to admit that I'm not entirely up to speed on the theory. Do you possibly have a reference investigating the accuracy of e.g. CBS extrapolation schemes using DF vs. those without?
(Also, a very beginner question... Does the use of DF adjust the overall scaling of the various schemes? For example, CCSD(T) goes as O(nocc^3 nvirt^4) -- does DF change this somehow? I would guess not...?)
Re: CCSD and CCSD(T) scaling. I've been running some more test jobs. The optimised (conventional, disk-based) CCSD code is certainly competitive for performance in a shared-memory setting, but disk I/O is indeed the bottleneck, even using a fast local SSD for storage. As you suggest, the on-disk load of the various integrals becomes prohibitively high around about 650 basis functions, and at that point, performance suffers relative to less disk-intensive CCSD(T) implementations in other codes such as NWChem. (Not a complaint, just an observation.)
For the higher-order calculations, I've been able to get up to CCSDT(Q) for 200+ basis functions (water with cc-pV5Z, so relatively few occupied orbitals) done without a problem. There seems to be more room to scale up here too.
Again, thanks for the info!
Your idea re: the use of DF for extrapolation is indeed interesting. In fact, the basic setting of my problem revolves around calibration of a composite scheme, so I'm already considering these approaches. As I mentioned, DF adds some more caveats to the extrapolation process, and I have to admit that I'm not entirely up to speed on the theory. Do you possibly have a reference investigating the accuracy of e.g. CBS extrapolation schemes using DF vs. those without?
(Also, a very beginner question... Does the use of DF adjust the overall scaling of the various schemes? For example, CCSD(T) goes as O(nocc^3 nvirt^4) -- does DF change this somehow? I would guess not...?)
Re: CCSD and CCSD(T) scaling. I've been running some more test jobs. The optimised (conventional, disk-based) CCSD code is certainly competitive for performance in a shared-memory setting, but disk I/O is indeed the bottleneck, even using a fast local SSD for storage. As you suggest, the on-disk load of the various integrals becomes prohibitively high around about 650 basis functions, and at that point, performance suffers relative to less disk-intensive CCSD(T) implementations in other codes such as NWChem. (Not a complaint, just an observation.)
For the higher-order calculations, I've been able to get up to CCSDT(Q) for 200+ basis functions (water with cc-pV5Z, so relatively few occupied orbitals) done without a problem. There seems to be more room to scale up here too.
Again, thanks for the info!
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
3 years 2 months ago #1133
by nagypeter
Replied by nagypeter on topic Best practices for parallel performance/scaling
There are a number of studies assessing also the accuracy of DF-CCSD(T), you can start e.g. with these:
pubs.acs.org/doi/10.1021/ct400250u
aip.scitation.org/doi/10.1063/1.4820484
aip.scitation.org/doi/10.1063/1.4905005
DF-CCSD(T) still scales the same as conventional CCSD(T).
The prefactor of some DF-CCSD steps is a bit lower, but the main benefit comes from the much smaller storage requirement of the integrals. In our implementation the I/O is basically eliminated (meaning you will hit an operation count or memory bottleneck much sooner).
Consequently the parallel scaling is also great and not limited by I/O or network speed, thus 1000-1500 orbitals become reachable with your hardware. Many more details about our code are given here:
pubs.acs.org/doi/abs/10.1021/acs.jctc.9b00957
pubs.acs.org/doi/abs/10.1021/acs.jctc.0c01077
pubs.acs.org/doi/10.1021/ct400250u
aip.scitation.org/doi/10.1063/1.4820484
aip.scitation.org/doi/10.1063/1.4905005
DF-CCSD(T) still scales the same as conventional CCSD(T).
The prefactor of some DF-CCSD steps is a bit lower, but the main benefit comes from the much smaller storage requirement of the integrals. In our implementation the I/O is basically eliminated (meaning you will hit an operation count or memory bottleneck much sooner).
Consequently the parallel scaling is also great and not limited by I/O or network speed, thus 1000-1500 orbitals become reachable with your hardware. Many more details about our code are given here:
pubs.acs.org/doi/abs/10.1021/acs.jctc.9b00957
pubs.acs.org/doi/abs/10.1021/acs.jctc.0c01077
Please Log in or Create an account to join the conversation.
Time to create page: 0.039 seconds