- Posts: 7
- Thank you received: 0
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
Acceleration in CCSDT calculation
- duanchenru
- Topic Author
- Offline
- New Member
Less
More
4 years 9 months ago #813
by duanchenru
Replied by duanchenru on topic Acceleration in CCSDT calculation
Hi Mihaly,
By giving it a larger memory, it runs much faster! Thanks for your help.
Best,
Chenru
By giving it a larger memory, it runs much faster! Thanks for your help.
Best,
Chenru
Please Log in or Create an account to join the conversation.
- Nike
- Offline
- Premium Member
Less
More
- Posts: 97
- Thank you received: 3
4 years 9 months ago #814
by Nike
Replied by Nike on topic Acceleration in CCSDT calculation
Dear Chenru,
You are using 16 cores. Are you able to use 32? If you roughly double the memory, you can use 32 cores. Or if you increase memory by roughly 1.5x, you can use 24 cores.
Also, I have found that using node local storage sped up my calculations by a large amount. Are you running this on a cluster? The cluster was originally doing all the I/O for intermediate files, on the /project/ or /work/ or /scratch/ space, but when I actually logged into the comput node and used the compute node's local storage in the /tmp/ directory, the I/O was much much faster.
Also you might be able to get the final converged result faster by reducing the number of iterations. This can be done by doing a CCSD calculation first, and then making sure you don't lose the fort.16 file which stores cluster amplitudes of the CCSD calculation. Then when CCSDT uses those converged cluster amplitudes, fewer iterations will be needed for CCSDT. I see you have already accomplished his by using rest=2, so good choice there!
You can also reduce the number of iterations by reducing the cc convergence criteria. Your CCSD calculation converged to micro-Hartree precision in 18 iterations but then it went up to 36 iterations before reporting convergence!! The last 18 iterations did not improve the energy much at all!
With best wishes,
Nike
You are using 16 cores. Are you able to use 32? If you roughly double the memory, you can use 32 cores. Or if you increase memory by roughly 1.5x, you can use 24 cores.
Also, I have found that using node local storage sped up my calculations by a large amount. Are you running this on a cluster? The cluster was originally doing all the I/O for intermediate files, on the /project/ or /work/ or /scratch/ space, but when I actually logged into the comput node and used the compute node's local storage in the /tmp/ directory, the I/O was much much faster.
Also you might be able to get the final converged result faster by reducing the number of iterations. This can be done by doing a CCSD calculation first, and then making sure you don't lose the fort.16 file which stores cluster amplitudes of the CCSD calculation. Then when CCSDT uses those converged cluster amplitudes, fewer iterations will be needed for CCSDT. I see you have already accomplished his by using rest=2, so good choice there!
You can also reduce the number of iterations by reducing the cc convergence criteria. Your CCSD calculation converged to micro-Hartree precision in 18 iterations but then it went up to 36 iterations before reporting convergence!! The last 18 iterations did not improve the energy much at all!
With best wishes,
Nike
The following user(s) said Thank You: duanchenru
Please Log in or Create an account to join the conversation.
- duanchenru
- Topic Author
- Offline
- New Member
Less
More
- Posts: 7
- Thank you received: 0
4 years 8 months ago #816
by duanchenru
Replied by duanchenru on topic Acceleration in CCSDT calculation
Hi Nike,
Thanks for your suggestions!
1) I generally do not find a very good scaling on CCSDT calculations. For example, I got a 1.5 acceleration in wall time when moving from 8 to 16 cores. But the wall time does not change from 16 to 32 cores. Even though I have sufficient memory in all three cases.
2) Do you know how to change the default I/O path to /tmp? I am submitting my calculations on a cluster though. So probably not easy.
3) Thanks. I learned this from the MRCC manual.
4) I could adjust it from 1e-6 (default) to 1e-5.
Best,
Chenru
Thanks for your suggestions!
1) I generally do not find a very good scaling on CCSDT calculations. For example, I got a 1.5 acceleration in wall time when moving from 8 to 16 cores. But the wall time does not change from 16 to 32 cores. Even though I have sufficient memory in all three cases.
2) Do you know how to change the default I/O path to /tmp? I am submitting my calculations on a cluster though. So probably not easy.
3) Thanks. I learned this from the MRCC manual.
4) I could adjust it from 1e-6 (default) to 1e-5.
Best,
Chenru
Please Log in or Create an account to join the conversation.
- Nike
- Offline
- Premium Member
Less
More
- Posts: 97
- Thank you received: 3
4 years 8 months ago #817
by Nike
Replied by Nike on topic Acceleration in CCSDT calculation
Dear Chenru,
Sufficient memory is not the same as optimal memory, as you noticed last week. You did say that increasing the RAM sped up your calculations at 16 cores (even though you had "sufficient" memory before). If you go up to 32 cores you have to increase the RAM again. At 16 cores your optimal RAM was 145GB, so at 32 cores it might be 250GB (you'll only know once you try). You can also try 24 cores. It would be quite a coincidence if *exactly* 16 cores were to be perfectly optimal.
Every cluster is different, so you may want to ask the staff in charge of the cluster about this. On one cluster I simply ssh into the compute nodes, copy the MINP, GENBAS, and submission script to the compute node, and literally run the job from the compute node in the /tmp directory. This way all fort.* files are written and read from /tmp/ on the node itself rather than in a /project/ or /work/ folder which is shared by many nodes. However very few clusters allow users to do this type of thing. On a different cluster, I had to ask the staff how to use "node local storage", and they did give me a way to do it, but it was less simple, and unique to that particular cluster.
Good luck with the calculations!
Nike
But the wall time does not change from 16 to 32 cores. Even though I have sufficient memory in all three cases.
Sufficient memory is not the same as optimal memory, as you noticed last week. You did say that increasing the RAM sped up your calculations at 16 cores (even though you had "sufficient" memory before). If you go up to 32 cores you have to increase the RAM again. At 16 cores your optimal RAM was 145GB, so at 32 cores it might be 250GB (you'll only know once you try). You can also try 24 cores. It would be quite a coincidence if *exactly* 16 cores were to be perfectly optimal.
Do you know how to change the default I/O path to /tmp? I am submitting my calculations on a cluster though. So probably not easy.
Every cluster is different, so you may want to ask the staff in charge of the cluster about this. On one cluster I simply ssh into the compute nodes, copy the MINP, GENBAS, and submission script to the compute node, and literally run the job from the compute node in the /tmp directory. This way all fort.* files are written and read from /tmp/ on the node itself rather than in a /project/ or /work/ folder which is shared by many nodes. However very few clusters allow users to do this type of thing. On a different cluster, I had to ask the staff how to use "node local storage", and they did give me a way to do it, but it was less simple, and unique to that particular cluster.
Good luck with the calculations!
Nike
The following user(s) said Thank You: duanchenru
Please Log in or Create an account to join the conversation.
- nagypeter
- Offline
- Premium Member
- MRCC developer
4 years 8 months ago #818
by nagypeter
Replied by nagypeter on topic Acceleration in CCSDT calculation
Dear Chenru,
The suggestions of Nike should help, I agree that a node-local disk (preferably SSD) or a network file system with fast I/O could speed up your calculation. There is probably an I/O bottleneck in your systems, that is why you do not see any more speedup above 16 cores.
A couple of more suggestions:
1) If you manage to speed up the I/O you might want to experiment with the MPI-OpenMP parallel version of the latest release. Although the MPI parallel layer scales much better for the (T) or (Q) parts, you might gain assuming fast communication.
2) The substantial speedup could be realized by compressing the virtual space in your calculation via, e.g., MP2 natural orbitals. I would recommend keywords eps and ovirt and Ref. [20] of the current manual for your consideration. This is an approximation, which in your case, you can easily correct for at the level of CCSD(T).
Best wishes,
Peter
The suggestions of Nike should help, I agree that a node-local disk (preferably SSD) or a network file system with fast I/O could speed up your calculation. There is probably an I/O bottleneck in your systems, that is why you do not see any more speedup above 16 cores.
A couple of more suggestions:
1) If you manage to speed up the I/O you might want to experiment with the MPI-OpenMP parallel version of the latest release. Although the MPI parallel layer scales much better for the (T) or (Q) parts, you might gain assuming fast communication.
2) The substantial speedup could be realized by compressing the virtual space in your calculation via, e.g., MP2 natural orbitals. I would recommend keywords eps and ovirt and Ref. [20] of the current manual for your consideration. This is an approximation, which in your case, you can easily correct for at the level of CCSD(T).
Best wishes,
Peter
The following user(s) said Thank You: duanchenru
Please Log in or Create an account to join the conversation.
- duanchenru
- Topic Author
- Offline
- New Member
Less
More
- Posts: 7
- Thank you received: 0
4 years 8 months ago #822
by duanchenru
Replied by duanchenru on topic Acceleration in CCSDT calculation
Hi Peter,
Thanks for the suggestions! I tried the second idea to compress the virtual orbital using OVOs and 80% as the cutoff. It accelerated the CCSDT calculation by 3 folds and only introducing ~10mH deviation. Since I only care about the relative correlation energy of different levels of CC, this error is kind of acceptable.
Best,
Chenru
Thanks for the suggestions! I tried the second idea to compress the virtual orbital using OVOs and 80% as the cutoff. It accelerated the CCSDT calculation by 3 folds and only introducing ~10mH deviation. Since I only care about the relative correlation energy of different levels of CC, this error is kind of acceptable.
Best,
Chenru
Please Log in or Create an account to join the conversation.
Time to create page: 0.043 seconds