restartxf,irdwfk in parallel run: error on npw,npw1

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
elena.mol
Posts: 17
Joined: Wed Apr 21, 2010 1:05 pm

restartxf,irdwfk in parallel run: error on npw,npw1

Post by elena.mol » Tue Aug 31, 2010 3:34 pm

Hello,
I found a problem which is quite similar to the one already discussed in viewtopic.php?f=9&t=84&p=243&hilit=chaining#p243.
However in my case the situation seems (at least to me?) a bit different, I made some further tests, and moreover it's not so clear to me which was the solution to the issue in the above mentioned post.

I'm trying to "chain" geometry optimization (ionmov=2) calculations, not by using any multi-dataset mode, but "manually" by perfoming several separate runs (which was the option suggested in the previous post if i understood well), each of which reads in input the WFK file created by the previous one, and using (starting from the 2nd run) the keywords: restartxf and irdwfk.
If I do this in a serial run (both with abinit5.8.4 and abinit6.0.4) everything works.
For parallel runs instead I find several kinds of errors (in the "2nd" run, i.e. the first time i use restartxf and irdwfk).

In particular:

1) using abinit5.8.4 or 6.0.4, both compiled without enable-mpi-io, and using for parallelization only the keywords paral_kgb and npband, I find an error similar to the one mentioned in
viewtopic.php?f=9&t=84&p=243&hilit=chaining#p243, i.e. (in the .log file)

-P-0000 hdr_check: Wavefunction file is OK for direct restart of calculation
-P-0000 ================================================================================
-P-0000 rwwf : BUG -
-P-0000 Reading option of rwwf. One should have npw=npw1
-P-0000 However, npw= 4404, and npw1=88083.



2) with abinit6.0.4 compiled with enable-mpi-io (but without enable-mpi-io-test), and using in input the keywords restartxf and irdwfk, (and paral_kgb and npband for parallelization), the code stops writing in the output files before the 1st scf step, but still appears to be running. These are the last lines written in the .log file:

[1,0]<stdout>: ITER STEP NUMBER 1
[1,0]<stdout>: vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
[1,0]<stdout>: starting lobpcg, with nblockbd,mpi_enreg%nproc_band 13 2
[1,0]<stdout>:WARNING in dpotrf, info= 1
[1,0]<stdout>: WARNING in zhegv, info= 3
[1,0]<stdout>:WARNING in dpotrf, info= 1
[1,0]<stdout>:WARNING in dpotrf, info= 1

The same happens with abinit6.0.4 compiled with enable-mpi-io and enable-mpi-io-test, but using a more complete set of parallelization keywords, i.e.:
PARAL_KGB= 1
NPKPT= 1
NPBAND= N
NPFFT= 1
WFOPTALG= 14
NLOALG= 4
FFTALG= 401
FFT_OPT_LOB= 2
ACCESSWFF = 1



3) abinit6.0.4 compiled with enable-mpi-io and enable-mpi-io-test, but with the "short" set of parall keywords (only paral_kgb and npband) stops with a segmentation fault. In this case the last lines of the .log file are:

[1,0]<stdout>: pspatm: atomic psp has been read and splines computed
[1,0]<stdout>:
[1,0]<stdout>: -4.82082600E+01 ecore*ucvol(ha*bohr**3)
[1,0]<stderr>:[node108:25684] *** Process received signal ***
[1,0]<stderr>:[node108:25684] Signal: Segmentation fault (11)
[1,0]<stderr>:[node108:25684] Signal code: (128)
[1,0]<stderr>:[node108:25684] Failing at address: (nil)
[1,0]<stderr>:[node108:25684] [ 0] /lib64/libpthread.so.0 [0x2af69f66db10]
[1,0]<stderr>:[node108:25684] [ 1] abinit(hdr_comm_+0x31f9) [0xb11129]
[1,0]<stderr>:[node108:25684] [ 2] abinit(hdr_io_wfftype_+0x10f) [0xb188ff]
[1,0]<stderr>:[node108:25684] [ 3] abinit(inwffil_+0x440c) [0x52e04c]
[1,0]<stderr>:[node108:25684] [ 4] abinit(gstate_+0x1ba2) [0x48f1c2]
[1,0]<stderr>:[node108:25684] [ 5] abinit(gstateimg_+0x2446) [0x4343c6]
[1,0]<stderr>:[node108:25684] [ 6] abinit(driver_+0x7283) [0x42ca73]
[1,0]<stderr>:[node108:25684] [ 7] abinit(MAIN__+0x328d) [0x42395d]
[1,0]<stderr>:[node108:25684] [ 8] abinit(main+0xe) [0x100ef5e]
[1,0]<stderr>:[node108:25684] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2af69f898994]
[1,0]<stderr>:[node108:25684] [10] abinit [0x420619]
[1,0]<stderr>:[node108:25684] *** End of error message ***

I hope these further tests can help, without confusing the subject. Does anyone know how to solve this problem? i.e. which is the best set of keywords/version/etc to use?

many thanks in advance
Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
20133, Milan, Italy

mverstra
Posts: 655
Joined: Wed Aug 19, 2009 12:01 pm

Re: restartxf,irdwfk in parallel run: error on npw,npw1

Post by mverstra » Sat Sep 04, 2010 9:34 am

Hello Elena,

could you please try 6.2.2? Things have been improved and debugged in mpi-io, and it is possible your bug has been corrected. If not we will have to wait for answers from the mpiio experts

Matthieu
Matthieu Verstraete
University of Liege, Belgium

Locked