severe multiple parallel scf and optimization problems
Posted: Wed May 25, 2016 10:58 am
I have been trying to narrow down why I cannot optimize a simple 10 atom structure in abinit 7.10.5. A typical behavior shows:
grep grad OUTFILE
max grad (force/stress) = 6.0828E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.8153E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 2.5553E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.3870E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 9.2946E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 5.2657E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.4388E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.9831E+00 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
But something is very sick. I am trying optcell=1 which should be volume only, and after a number of steps I get:
Current rprimd gives negative (R1xR2).R3 .
Rprimd = 1.838443E+02 -1.838443E+02 -2.672648E+02
-1.838443E+02 1.838443E+02 -2.672648E+02
-1.838443E+02 -1.838443E+02 2.672648E+02
Action: if the cell size and shape are fixed (optcell==0),
exchange two of the input rprim vectors;
if you are optimizing the cell size and shape (optcell/=0),
maybe the move was too large, and you might try to decrease strprecon.
src_file: metric.F90
But with optcell =1 the rprimd should not be changing sign!
I have tried many things, and the latest input looks like :
ndtset 3
jdtset 1 2 3
paral_kgb 0 npband 1 npfft 1 npkpt 28
ecutsm 0.5
dilatmx 1.10
natom 10
ntypat 3
znucl 82 22 8
typat 1*1 1*2 2*3
#scalecart 9.73 9.73 14.036
scalecart 10.093073533998 10.093073533998 14.6728671828548
angdeg 3*90
spgroup 108
brvltt -1
natrd 4
xred
0.5 0 0.2515
0. 0. 0.5013
0.23 0.73 -0.04
0. 0. 0.75157
ngkpt 6 6 6
shiftk 0.5 0.5 0.5
occopt 7
nband 80
#tsmear 0.005
pawecutdg 56
pawmixdg 1
ecut 40
ixc 23
nstep 50
diemac 10.0
diemix 0.3
kptopt 1
toldff 5d-7
iscf 14
ionmov1 2
dtion 100
ionmov2 7
ionmov3 2
optcell1 1
optcell2 0
optcell3 2
ntime1 20
ntime2 20
ntime3 30
tolmxf1 1d-3
tolmxf2 5d-6
tolmxf3 1d-6
strtarget 3*-0.000679785784544 3*0.0
#restartxf=-1
strfact 100
strprecon 0.1
Occasionally by luck one of these will converge after many attempts, just by resubmitted the failed job it will sometimes not go crazy and continue OK. It seems like this must be a bug, either in abinit or the intel compiler (15.0 I believe) . It is compiled with:
fftw/serial/3.3 wannier90/1.2 gsl/1.16 libxc/2.0 netcdf/4.3 szip/2.1
We ran the test suite which mostly seems oK, but there are not many tests with PAWs and optimization, and none with more than a handful of processors that I could find. I have been trying to run with :
paral_kgb 0 npband 1 npfft 1 npkpt 28
on 28 cpus
and
paral_kgb 1 npband 5 npfft 1 npkpt 28
on 140. The later dies after some scf steps with
m_abi_linalg_mp_x 106 abi_xorthonormalize.f90
Any help would be greatly appreciated!
Ron Cohen
grep grad OUTFILE
max grad (force/stress) = 6.0828E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.8153E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 2.5553E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.3870E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 9.2946E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 5.2657E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.4388E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.9831E+00 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
But something is very sick. I am trying optcell=1 which should be volume only, and after a number of steps I get:
Current rprimd gives negative (R1xR2).R3 .
Rprimd = 1.838443E+02 -1.838443E+02 -2.672648E+02
-1.838443E+02 1.838443E+02 -2.672648E+02
-1.838443E+02 -1.838443E+02 2.672648E+02
Action: if the cell size and shape are fixed (optcell==0),
exchange two of the input rprim vectors;
if you are optimizing the cell size and shape (optcell/=0),
maybe the move was too large, and you might try to decrease strprecon.
src_file: metric.F90
But with optcell =1 the rprimd should not be changing sign!
I have tried many things, and the latest input looks like :
ndtset 3
jdtset 1 2 3
paral_kgb 0 npband 1 npfft 1 npkpt 28
ecutsm 0.5
dilatmx 1.10
natom 10
ntypat 3
znucl 82 22 8
typat 1*1 1*2 2*3
#scalecart 9.73 9.73 14.036
scalecart 10.093073533998 10.093073533998 14.6728671828548
angdeg 3*90
spgroup 108
brvltt -1
natrd 4
xred
0.5 0 0.2515
0. 0. 0.5013
0.23 0.73 -0.04
0. 0. 0.75157
ngkpt 6 6 6
shiftk 0.5 0.5 0.5
occopt 7
nband 80
#tsmear 0.005
pawecutdg 56
pawmixdg 1
ecut 40
ixc 23
nstep 50
diemac 10.0
diemix 0.3
kptopt 1
toldff 5d-7
iscf 14
ionmov1 2
dtion 100
ionmov2 7
ionmov3 2
optcell1 1
optcell2 0
optcell3 2
ntime1 20
ntime2 20
ntime3 30
tolmxf1 1d-3
tolmxf2 5d-6
tolmxf3 1d-6
strtarget 3*-0.000679785784544 3*0.0
#restartxf=-1
strfact 100
strprecon 0.1
Occasionally by luck one of these will converge after many attempts, just by resubmitted the failed job it will sometimes not go crazy and continue OK. It seems like this must be a bug, either in abinit or the intel compiler (15.0 I believe) . It is compiled with:
fftw/serial/3.3 wannier90/1.2 gsl/1.16 libxc/2.0 netcdf/4.3 szip/2.1
We ran the test suite which mostly seems oK, but there are not many tests with PAWs and optimization, and none with more than a handful of processors that I could find. I have been trying to run with :
paral_kgb 0 npband 1 npfft 1 npkpt 28
on 28 cpus
and
paral_kgb 1 npband 5 npfft 1 npkpt 28
on 140. The later dies after some scf steps with
m_abi_linalg_mp_x 106 abi_xorthonormalize.f90
Any help would be greatly appreciated!
Ron Cohen