- Posts: 97
- Thank you received: 3
If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
This information really helps us during troubleshooting
- the way mrcc was invoked
- the way build.mrcc was invoked
- the output of build.mrcc
- compiler version (for example: ifort -V, gfortran -v)
- blas/lapack versions
- as well as gcc and glibc versions
This information really helps us during troubleshooting
Restarts during perturbative corrections?
- Nike
- Topic Author
- Offline
- Premium Member
Less
More
5 years 9 months ago #678
by Nike
Restarts during perturbative corrections? was created by Nike
Dear all,
I understand that if a job crashes during the perturbative corrections, then using rest=1 will mean MRCC goes back to the iterative step for 1 iteration, then starts the perturbative correction all the way from the beginning (spin case 1).
I have just had a job crash on "spin case 19" after about 22 days (32000 minutes) of calculations to do spin cases 1-18:
I wonder if it's possible to make it so that we can save the results from spin cases 1-18, and then when we restart, just start with spin-case 19 immediately?
Also, I wonder if the energy contributions for each spin-case can be printed when the spin-case finishes?
The reason why is that I don't want to wait another 22 days to do spin-cases 1-18 all over again, and in the cases where there's only 19 spin-cases in total, then an estimate based on spin-cases 1-18 seems good enough. (of course ideally we'd do all 19 spin-cases, but if it means we have to restart from spin-case 1 and wait 22 days, we would rather just use the energy estimate coming from the first 18/19 spin cases).
Finally, is there an equation we can refer to in order to understand the different spin cases? I have looked at:
M. Kállay and J. Gauss (2008) Approximate treatment of higher excitations in coupled-cluster theory. II. Extension to general single-determinant reference functions and improved approaches for the canonical Hartree–Fock case. J. Chem. Phys. 129, pp. 144101, in the context of the notation in:
M. Kállay and J. Gauss (2005) Approximate treatment of higher excitations in coupled-cluster theory. J. Chem. Phys. 123, pp. 214105.
and:
Y. J. Bomble, J. F. Stanton, M. Kállay and J. Gauss (2005) Coupled cluster methods including non-iterative approximate quadruple excitation corrections. J. Chem. Phys. 123, pp. 054101.
I understand usually we only have 3 spin-cases and it becomes more spin-cases when we don't have enough RAM to do the 3 spin-cases fully, but what are the new spin-cases and how are they designed?
With best wishes!
Nike
I understand that if a job crashes during the perturbative corrections, then using rest=1 will mean MRCC goes back to the iterative step for 1 iteration, then starts the perturbative correction all the way from the beginning (spin case 1).
I have just had a job crash on "spin case 19" after about 22 days (32000 minutes) of calculations to do spin cases 1-18:
Code:
Perturbative corrections are calculated...
======================================================================
Spin case 1 Alpha: 2 Beta: 4
Number of excitations: 9674854
CPU time [min]: 18814.564 Wall time [min]: 9849.788
======================================================================
Spin case 2 Alpha: 3 Beta: 3
Number of excitations: 53660880
CPU time [min]: 20286.653 Wall time [min]: 10593.980
======================================================================
Spin case 3 Alpha: 4 Beta: 2
Number of excitations: 29030306
CPU time [min]: 21454.478 Wall time [min]: 11180.785
======================================================================
Spin case 4 Alpha: 5 Beta: 1
Number of excitations: 1377072
CPU time [min]: 21644.188 Wall time [min]: 11276.145
======================================================================
Spin case 5 Alpha: 2 Beta: 4
Number of excitations: 18634890
CPU time [min]: 22963.355 Wall time [min]: 11946.315
======================================================================
Spin case 6 Alpha: 2 Beta: 4
Number of excitations: 41794692
CPU time [min]: 25978.119 Wall time [min]: 13466.476
======================================================================
Spin case 7 Alpha: 3 Beta: 3
Number of excitations: 160982640
CPU time [min]: 31526.117 Wall time [min]: 16290.082
======================================================================
Spin case 8 Alpha: 3 Beta: 3
Number of excitations: 167173416
CPU time [min]: 36412.186 Wall time [min]: 18741.087
======================================================================
Spin case 9 Alpha: 4 Beta: 2
Number of excitations: 120744756
CPU time [min]: 41132.122 Wall time [min]: 21124.516
======================================================================
Spin case 10 Alpha: 4 Beta: 2
Number of excitations: 58067420
CPU time [min]: 43282.776 Wall time [min]: 22209.112
======================================================================
Spin case 11 Alpha: 5 Beta: 1
Number of excitations: 7463460
CPU time [min]: 44156.062 Wall time [min]: 22646.983
======================================================================
Spin case 12 Alpha: 5 Beta: 1
Number of excitations: 1327932
CPU time [min]: 44337.005 Wall time [min]: 22737.989
======================================================================
Spin case 13 Alpha: 2 Beta: 4
Number of excitations: 8318486
CPU time [min]: 44885.268 Wall time [min]: 23013.016
======================================================================
Spin case 14 Alpha: 2 Beta: 4
Number of excitations: 80495544
CPU time [min]: 51382.765 Wall time [min]: 26288.737
======================================================================
Spin case 15 Alpha: 2 Beta: 4
Number of excitations: 62686156
CPU time [min]: 55646.355 Wall time [min]: 28444.712
======================================================================
Spin case 16 Alpha: 3 Beta: 3
Number of excitations: 149058000
CPU time [min]: 60427.369 Wall time [min]: 30840.410
======================================================================
Spin case 17 Alpha: 3 Beta: 3
Number of excitations: 501528416
CPU time [min]: 77089.610 Wall time [min]: 39213.258
======================================================================
Spin case 18 Alpha: 3 Beta: 3
Number of excitations: 160979568
CPU time [min]: 82699.806 Wall time [min]: 42054.491
======================================================================
Spin case 19 Alpha: 4 Beta: 2
Number of excitations: 174126680
Fatal error in mrcc.
Program will stop.
************************ 2019-02-04 08:03:39 *************************
Error at the termination of mrcc.
**********************************************************************
I wonder if it's possible to make it so that we can save the results from spin cases 1-18, and then when we restart, just start with spin-case 19 immediately?
Also, I wonder if the energy contributions for each spin-case can be printed when the spin-case finishes?
The reason why is that I don't want to wait another 22 days to do spin-cases 1-18 all over again, and in the cases where there's only 19 spin-cases in total, then an estimate based on spin-cases 1-18 seems good enough. (of course ideally we'd do all 19 spin-cases, but if it means we have to restart from spin-case 1 and wait 22 days, we would rather just use the energy estimate coming from the first 18/19 spin cases).
Finally, is there an equation we can refer to in order to understand the different spin cases? I have looked at:
M. Kállay and J. Gauss (2008) Approximate treatment of higher excitations in coupled-cluster theory. II. Extension to general single-determinant reference functions and improved approaches for the canonical Hartree–Fock case. J. Chem. Phys. 129, pp. 144101, in the context of the notation in:
M. Kállay and J. Gauss (2005) Approximate treatment of higher excitations in coupled-cluster theory. J. Chem. Phys. 123, pp. 214105.
and:
Y. J. Bomble, J. F. Stanton, M. Kállay and J. Gauss (2005) Coupled cluster methods including non-iterative approximate quadruple excitation corrections. J. Chem. Phys. 123, pp. 054101.
I understand usually we only have 3 spin-cases and it becomes more spin-cases when we don't have enough RAM to do the 3 spin-cases fully, but what are the new spin-cases and how are they designed?
With best wishes!
Nike
Please Log in or Create an account to join the conversation.
- kallay
- Offline
- Administrator
- Mihaly Kallay
5 years 9 months ago #679
by kallay
Best regards,
Mihaly Kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
Unfortunately, it is not possible to restart the perturbative corrections.
Unfortunately, it is not possible to restart the perturbative corrections.
Best regards,
Mihaly Kallay
Please Log in or Create an account to join the conversation.
- Nike
- Topic Author
- Offline
- Premium Member
Less
More
- Posts: 97
- Thank you received: 3
5 years 8 months ago #681
by Nike
Replied by Nike on topic Restarts during perturbative corrections?
Dear Mihaly,
Is it at least possible to print out the energy contribution from each spin case? I have a case where 125 spin cases out of 127 completed, before the crash. The calculation took about 100 days. I started the calculation again, but if it crashes after 125 or 126 spin cases, I'd like to be able to add up the energy contributions from spin cases 1-125 as an "estimate" of what I would get with 127 spin cases.
With best wishes,
Nike
Is it at least possible to print out the energy contribution from each spin case? I have a case where 125 spin cases out of 127 completed, before the crash. The calculation took about 100 days. I started the calculation again, but if it crashes after 125 or 126 spin cases, I'd like to be able to add up the energy contributions from spin cases 1-125 as an "estimate" of what I would get with 127 spin cases.
With best wishes,
Nike
Please Log in or Create an account to join the conversation.
- kallay
- Offline
- Administrator
- Mihaly Kallay
5 years 8 months ago #685
by kallay
Best regards,
Mihaly Kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
Please edit file pert.f, print out variables corr1 and corr2 after line 601, and recompile the code. This will write out the cumulated perturbative corrections.
Please edit file pert.f, print out variables corr1 and corr2 after line 601, and recompile the code. This will write out the cumulated perturbative corrections.
Best regards,
Mihaly Kallay
Please Log in or Create an account to join the conversation.
- Nike
- Topic Author
- Offline
- Premium Member
Less
More
- Posts: 97
- Thank you received: 3
5 years 7 months ago #687
by Nike
Replied by Nike on topic Restarts during perturbative corrections?
Dear Mihaly,
Thanks for the suggestion.
I have added the two WRITE statements below:
and re-compiled successfully but did not see any new output when I ran the program.
Also, I can see why one would want to parallelize within each spin case if there's only 1 or 2 spin cases, but when the # of spin cases is about the same as (or greater than) the number of cores, should we parallelize over the spin cases rather than parallelizing within the spin cases? Here is an example with 2 cores:
The speed-up we get by dividing CPU Time by Wall Time, is only 1.89. Wouldn't it be a perfect 2x speed-up if we did each spin case on a different core (perhaps load balanced so that each core has to do roughly the same number of total excitations?
The situation is much more pronounced when we have more cores, here I used 11 cores:
We can see the CPU time vs Wall time for the first spin case is only 7.7x speed-up (110395.983/14322.505 = 7.7). Whereas if we did 11 spin cases on 11 different cores, would get a perfect 11x speed-up. In this case there was 22 total spin cases to do, so we could have easily had each core doing a separate spin case.
With best wishes!
Nike
Thanks for the suggestion.
I have added the two WRITE statements below:
Code:
corr1=corr1+fct*sum1
corr2=corr2+fct*sum2
else
corr2=corr2+sum2
endif
endif
write(iout,"(' corr1 contribution: ',f18.12)") corr1
write(iout,"(' corr2 contribution: ',f18.12)") corr2
enddo !while
and re-compiled successfully but did not see any new output when I ran the program.
Also, I can see why one would want to parallelize within each spin case if there's only 1 or 2 spin cases, but when the # of spin cases is about the same as (or greater than) the number of cores, should we parallelize over the spin cases rather than parallelizing within the spin cases? Here is an example with 2 cores:
Code:
Perturbative corrections are calculated...
======================================================================
Spin case 1 Alpha: 3 Beta: 2
Number of excitations: 464494585
CPU time [min]: 10291.275 Wall time [min]: 5419.409
======================================================================
Spin case 2 Alpha: 4 Beta: 1
Number of excitations: 113869560
CPU time [min]: 12329.565 Wall time [min]: 6445.038
======================================================================
Spin case 3 Alpha: 3 Beta: 2
Number of excitations: 1117504190
CPU time [min]: 24282.021 Wall time [min]: 12538.886
======================================================================
Spin case 4 Alpha: 3 Beta: 2
Number of excitations: 755944353
CPU time [min]: 31961.789 Wall time [min]: 16453.636
======================================================================
Spin case 5 Alpha: 4 Beta: 1
Number of excitations: 368830759
CPU time [min]: 39072.617 Wall time [min]: 20057.350
======================================================================
Spin case 6 Alpha: 4 Beta: 1
Number of excitations: 91754365
CPU time [min]: 40736.924 Wall time [min]: 20900.330
======================================================================
Spin case 7 Alpha: 3 Beta: 2
Number of excitations: 876483136
CPU time [min]: 49811.214 Wall time [min]: 25510.303
======================================================================
Spin case 8 Alpha: 3 Beta: 2
Number of excitations: 1818685774
CPU time [min]: 69571.559 Wall time [min]: 35771.351
======================================================================
Spin case 9 Alpha: 3 Beta: 2
Number of excitations: 300908380
CPU time [min]: 72551.242 Wall time [min]: 37269.068
======================================================================
Spin case 10 Alpha: 4 Beta: 1
Number of excitations: 438212395
CPU time [min]: 80907.590 Wall time [min]: 41475.333
======================================================================
Spin case 11 Alpha: 4 Beta: 1
Number of excitations: 297219400
Fatal error in mrcc.
Program will stop.
************************ 2019-03-18 18:24:40 *************************
Error at the termination of mrcc.
**********************************************************************
The speed-up we get by dividing CPU Time by Wall Time, is only 1.89. Wouldn't it be a perfect 2x speed-up if we did each spin case on a different core (perhaps load balanced so that each core has to do roughly the same number of total excitations?
The situation is much more pronounced when we have more cores, here I used 11 cores:
Code:
Perturbative corrections are calculated...
======================================================================
Spin case 1 Alpha: 3 Beta: 2
Number of excitations: 1539990258
CPU time [min]:110395.983 Wall time [min]: 14322.505
======================================================================
Spin case 2 Alpha: 3 Beta: 2
Number of excitations: 1540225714
CPU time [min]:137626.477 Wall time [min]: 16946.664
======================================================================
Spin case 3 Alpha: 4 Beta: 1
Number of excitations: 759732452
CPU time [min]:153466.773 Wall time [min]: 18457.198
======================================================================
Spin case 4 Alpha: 3 Beta: 2
Number of excitations: 3901717692
CPU time [min]:226253.656 Wall time [min]: 25559.211
======================================================================
Spin case 5 Alpha: 3 Beta: 2
Number of excitations: 2625110920
CPU time [min]:273343.943 Wall time [min]: 30067.927
======================================================================
Spin case 6 Alpha: 3 Beta: 2
Number of excitations: 3902968798
CPU time [min]:342327.476 Wall time [min]: 37179.708
======================================================================
Spin case 7 Alpha: 3 Beta: 2
Number of excitations: 2625652424
CPU time [min]:388852.180 Wall time [min]: 41649.681
======================================================================
Spin case 8 Alpha: 4 Beta: 1
Number of excitations: 2583772872
CPU time [min]:443175.096 Wall time [min]: 46691.254
======================================================================
Spin case 9 Alpha: 4 Beta: 1
Number of excitations: 643242402
CPU time [min]:456705.232 Wall time [min]: 47999.934
======================================================================
Spin case 10 Alpha: 3 Beta: 2
Number of excitations: 3247203320
Fatal error in mrcc.
Program will stop.
************************ 2019-01-28 11:48:09 *************************
Error at the termination of mrcc.
**********************************************************************
We can see the CPU time vs Wall time for the first spin case is only 7.7x speed-up (110395.983/14322.505 = 7.7). Whereas if we did 11 spin cases on 11 different cores, would get a perfect 11x speed-up. In this case there was 22 total spin cases to do, so we could have easily had each core doing a separate spin case.
With best wishes!
Nike
Please Log in or Create an account to join the conversation.
- kallay
- Offline
- Administrator
- Mihaly Kallay
5 years 7 months ago #688
by kallay
Best regards,
Mihaly Kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
What sort of calculation are you running? I have tested it for CCSDT(Q), and it works correctly.
Concerning the spin cases: please note that the division of the spin cases is related to the memory requirements and not to parallelization.
What sort of calculation are you running? I have tested it for CCSDT(Q), and it works correctly.
Concerning the spin cases: please note that the division of the spin cases is related to the memory requirements and not to parallelization.
Best regards,
Mihaly Kallay
Please Log in or Create an account to join the conversation.
Time to create page: 0.046 seconds