Page 1 of 1
SCF run crashed when writing WFK: how to use the DEN file?
Posted: Thu Apr 05, 2012 5:26 pm
by elena.mol
Dear all,
is there a way to obtain a WFK file from a DEN file generated by a SCF abinit calculation which has crashed at convergence, after writing the DEN, but before being able to write a complete WFK? ...without having to restart the calculation from scratch, of course.
This question is due to some problems I'm finding in SCF runs for “big” systems: in particular, i'm considering a molecule (175 atoms), using 1 kpoint, 540 bands, and a cubic cell of 64000 A^3. Everything works fine up to ecut=11 Ha., but with ecut=16 Ha. the SCF run crashes at the end, precisely when trying to write the WFK file (after having written a correct DEN file).
I suspect these problems might be due to some disk space issue, since for smaller systems, but with the same cutoff and the same “pathology”, in same cases I succeeded in avoiding the crash by running the job on a disk with larger free space. I noticed there are other topics in this forum related to problems/crashes in writing WFK files, but it's not clear to me if the matter has been solved yet.
I'm using abinit-6.10.3 and, if needed, I can provide input files etc, but for the moment I would just like to make clear (maybe it's very trivial) if i can “find a shortcut” by obtaining the WFK file from the DEN one. ...although maybe the problem would appear again, if it's due to lack of disk space *only* at the moment of writing the WFK?...
Many thanks in advance
cheers
Elena
Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu
Re: SCF run crashed when writing WFK: how to use the DEN fil
Posted: Wed Apr 11, 2012 3:28 am
by kaneod
Hi Elena,
If you only need the WFK file from the final density (no other properties) you can just restart the calculation with iscf -3 and use tolwfr (instead of tolvrs/toldfe) with a suitably small number to make sure the KS orbitals are converged. Note that you may need to comment out other parts of your input file (anything related to structural optimization, for example). You'll also need to make sure the final density from the crashed calculation is available as an input file in the form in_DEN where "in" is the prefix given in the third line of your ab.files file.
The WFK file crashing issue(s) are not resolved as far as I'm aware but the reports are very scattered with different versions of abinit, types of crash/hang and usage so I imagine we are going to have to live with them for the time being until someone can isolate the issue. I'm working on it here but we haven't got any answers yet. If the problem is isolated to the WFK write not the disk space you might get close to what you want using pawprtwf. This needs to be done completely in serial (no kpt or kgb parallelization) and bear in mind you get the AE wavefunctions, not the pseudoized ones. Also: a warning, the AE_WFK files are gigantic - I routinely have files more than 50GB in size for large molecules.
Re: SCF run crashed when writing WFK: how to use the DEN fil
Posted: Sun May 06, 2012 3:56 pm
by sponce
Dear Elena,
I do actually think that I have a similar problem. For a system of 23 atoms with 4x4x4 MP k-pt grid and with ecut 20Ha I got the following bug report in my log file for a simple ground state calculation:
Code: Select all
-P-0000 leave_test : synchronization done...
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file ecuto_DS1_WFK
-P-0000 leave_test : synchronization done...
m_wffile.F90:279:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length=4
File locking failed in ADIOI_Set_lock(fd 17,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
File locking failed in ADIOI_Set_lock(fd 18,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 83466, length 27648
File locking failed in ADIOI_Set_lock(fd 17,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 138762, length 27648
File locking failed in ADIOI_Set_lock(fd 18,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 26.
If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 166410, length 27660
ADIOI_Set_lock:: Function not implemented
ADIOI_Set_lock:offset 111114, length 27648
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
It systematically crashes just after the writing of the WFK file. I'm using abinit-6.12.3.
Cheers!
Samuel.
Re: SCF run crashed when writing WFK: how to use the DEN fil
Posted: Wed Aug 01, 2012 11:04 am
by delaveau
When WFK file crashes for big case. It migth be because the buffer to be write at the same times is to big.
A way of decreasing the size of the buffer is in WffReadWrite_mpio.F90 to put MAXBAND to a lower value.
Actually, it's put at 500 meaning that a msg of at most 500*taille of one band is written at the same time.
A value of 500 has been taken to reduce the number of disk access (very time consuming)
Muriel Delaveau
Re: SCF run crashed when writing WFK: how to use the DEN fil
Posted: Wed Aug 01, 2012 11:31 am
by delaveau
A possible reason for having crash in case of big system when writing file WFK is the size of message to be written at the same time.
this size is fixed by MAXBAND=500 in WffReadWrite_mpio.F90 . It means that a message of 500*size of one band ( sie of cg for one band) is on memory.
This to minimize the number of disk access who can be very expansive in time.
So a way to avoid this reason of crash is to decrease MAXBAND.
I hope it could help
Muriel Delaveau
Re: SCF run crashed when writing WFK: how to use the DEN fil
Posted: Wed Aug 01, 2012 3:06 pm
by delaveau
A possible cause for crash of big case is the size of the message to be written or read at the same time.
The size of the message is parameted in WffReadWrite_mpio.F90 by MAXBAND=500
The way the written is done is hide under mpi-io implementation.
but the size of the message could be decrease by decreasing MAXBAND.
MAXBAND=500 has been choseen for performance reason because it minimize the number of disk acces
So a way for not having the crash is to lowest the size of the message by putting for instance MAXBAND=30
I hope it will help
Muriel Delaveau