Page 1 of 1
Calculation stops without error when switching on D-field
Posted: Fri Sep 15, 2017 6:42 pm
by juhl
Dear Abinit community,
I am trying to include a D-field (displacement) in a calculation on CeO2. Calculations without field ran fine with this setup and I was following the tutorial for applying an E-field. However, in the first run that includes a (D)-field, the calculation always stops during the third SCF cycle. I tried a variety of things but the problem persists, even when setting the field to 0. Since I am fairly new to Abinit and there is no real error message, I was hoping to find some help if anything is not done correctly. Input and Output files are attached.
Re: Calculation stops without error when switching on D-fiel
Posted: Sun Sep 17, 2017 4:13 pm
by mverstra
Hi Juhl,
you are mixing lots of stuff: parallel, PAW, DFT+U calculations of the constant D. Could you try (with small ecut and nkpt) to see if things crash in sequential, without U, and for norm conserving, or with combinations of the three?
If a small calculation with all options on crashes, then we have something. It could be simply that something is not deallocated properly with this combination, and after a few iterations you overflow the memory. The "official" memory footprint is very small.
cheers
Matthieu
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 12:39 am
by juhl
Dear Matthieu,
agreed. Trying to locate the error, I ran the same system with:
A) no U
B) no U and no PAWs
C) no U no PAWs but nsppol 2
All these give the same result (crash during iteration #3 in DATASET 21).
Since you suggested to further locate the error, I assume you did not spot any error/inconsistency in the input. Other than a bug, are there other physical things that could go wrong for a D-field calculation?
I am running
D) no U and no PAWs sequential
right now and will update asap.
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 8:59 am
by mverstra
The D code has not been extensively used so there may well be authentic bugs in compatibility with certain features like paw or non collinear spin. In principle if your input is wrong the code should complain rather than proceed and die.
How fast a calculation can you make by reducing ecut and nkpt but still crash?
Keep us posted.
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 4:09 pm
by juhl
UPDATE:
The sequential run made it into DATASET 31, i.e., this suggests that the problem is the mpiparallelisation, which is quite unfortunate for my intended purpose. To confirm this, could you provide an example where D-field ran successfully in parallel before? Since the crash occurs during DATASET 21, i.e., the first run with active D-field, it suggests that the problem is within the routine for the D-field.
to answer your question, I can run this example right now in about 2 minutes until it crashes.
UPDATE2:
As additional test I took the test tffield_6 from the abinit/tests/tutorespfn directory. I first ran the test as is, i.e., with efield. Both serial as well as parallel versions of the code are running fine. Switching from efield to dfield, I get the same error as in my example: the calculation stops during iteration step #3 in the first dataset the dfield is applied.
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 10:50 pm
by mverstra
Hi again,
I have tried your input with ecut 6 and ngkpt 1 1 1 and 2 2 2, and it does crash (even in sequential), perhaps for the same reason as you: what I get is an internal consistency check which fails, and the job stops.
scfcv (electric field calculation) : WARNING -
The difference between pel (electronic Berry phase updated
at each SCF cycle)
and pel_cg (electronic Berryphase computed using the berryphase routine) is
pdif_mod = 0.675410089E-06
--- !WARNING
src_file: elpolariz.F90
src_line: 261
message: |
pel_cg(1) = 0.172603609E-09
pel_cg(2) = -0.134600664E-08
pel_cg(3) = -0.934581435E-05
pel(1) = -0.933478809E-06
pel(2) = -0.800590687E-06
pel(3) = 0.812486637E-03
---- KILL ---
If you look at the corresponding source file elpolariz.F90 , the output is followed by the command MSG_ERROR, which kills the job, instead of MSG_WARNING, which allows it to continue. If I replace ERROR with WARNING in that line and recompile, then use 2 2 2 k-points, I run through to dtset 31 in sequential, and things stall there after iteration 3. It could be that this numerical acuracy (2 ways of calculating the polarization which do not agree) is broken by PAW, or by U, or simply needs to be converged away with nkpt...
And it could be that this is a borderline case due to low nkpt or ecut, and is not related to your problems...
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 11:25 pm
by juhl
Hi,
based on the testing so far, why would it be related to +U, PAW? Isn't it much more likely that the serial version appears to run okay and that the problem is in the parallel part (to test this I am running including PAW and +U in serial right now and it is already past the point where the parallel calculation breaks)?
But what you got is interesting. When I use the e-filed instead of d-field, I get the same error that you get (elpolariz.F90)
Running with the setup you wrote (ecut 6, nkkopt 1 1 1). I get the old crash (DATASET 21, iteration #3) without the error you got, which might indicate that writing this error might be different in our versions? (as it can be seen in my .out file I am using 8.4.3) Than it could indeed all be related to the elpolariz error. I'll try increasing the kgrid.
UPDATE: hmm, looking in the code, this test should be there for berryopt 4 and 6, so it would be strange if this would be the problem without being printed, so I think this possibility can be disregarded, coming back to the fact that the problem appears to be only in the parallel part. Furthermore, I already tried doubling the kgrid to 8 8 8, and the behavior was the same (crash, no further error).
Re: Calculation stops without error when switching on D-fiel
Posted: Wed Sep 20, 2017 11:34 pm
by mverstra
Not being printed may simply be because of a caching problem on the nodes you use, if the writing to the log file is not flushed properly as the code dies you do not have the last chunk of text and the error.
First thing: turn MSG_ERROR it into a MSG_WARNING and try in seq and parallel.
Re: Calculation stops without error when switching on D-fiel
Posted: Thu Sep 21, 2017 7:43 pm
by juhl
No, after recompiling with MSG_WARNING instead of error (l.258 in elpolariz.F90), I get the same behavior, i.e., the parallel run stops in DATASET 21, Iteration #3
Re: Calculation stops without error when switching on D-fiel
Posted: Mon Sep 25, 2017 8:37 pm
by juhl
Any further ideas? It seems like the issue should be not too hard to fix once the error is more or less localized since the sequential code works fine (I guess the numerical problems you ran into could have been caused by the reduced cutoff. I am trying to reproduce this behavior, but I think the important message is still that the parallel run fails while the sequential does not, with otherwise unchanged parameters.)
EDIT:
You mentioned that the D-field code was not tested extensively yet. Could it be that it just never ran in the mpi version and we are looking for a bug?
Comparing the output from the parallel run (fail) and sequential run (i.e., the last/first print statements before/after the code breaks), a bug should be caused in 79_seqpar_mpi/vtorho.F90 in between lines 489 and 861.
Re: Calculation stops without error when switching on D-fiel
Posted: Mon Sep 25, 2017 10:30 pm
by mverstra
Hi Again
No precise suggestions short of running with one thread under gdb to catch the real error. I will not have time to look into this for some days. If you want to try:
1) set the code running in parallel in the background.
2) use ps aux | grep abinit to find the process ID (pid) of the main executing instance (usually the lowest pid of the abinits)
3) run gdb /PATH/abinit <pid>
And it should latch on to the running executable and pause it.
4) type "cont" (no quotes) to get gdb to continue
5) when it crashes it should give you a stack of subroutines that had been called. In the best case if it can find the sources of Abinit, gdb will give you a line number in the file.
6) type "up" recursively to visit calling subroutines and see where everything is being called from.
7) send us all of this!
Perhaps you already know all of this, but it will be helpful to someone someday.