severe multiple parallel scf and optimization problems

recohen · Post by **recohen** » Wed May 25, 2016 10:58 am

I have been trying to narrow down why I cannot optimize a simple 10 atom structure in abinit 7.10.5. A typical behavior shows:

grep grad OUTFILE
max grad (force/stress) = 6.0828E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.8153E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 2.5553E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.3870E-02 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 9.2946E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 5.2657E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 4.4388E-03 > tolmxf= 1.0000E-03 ha/bohr (free atoms)
max grad (force/stress) = 1.9831E+00 > tolmxf= 1.0000E-03 ha/bohr (free atoms)

But something is very sick. I am trying optcell=1 which should be volume only, and after a number of steps I get:
Current rprimd gives negative (R1xR2).R3 .
Rprimd = 1.838443E+02 -1.838443E+02 -2.672648E+02
-1.838443E+02 1.838443E+02 -2.672648E+02
-1.838443E+02 -1.838443E+02 2.672648E+02
Action: if the cell size and shape are fixed (optcell==0),
exchange two of the input rprim vectors;
if you are optimizing the cell size and shape (optcell/=0),
maybe the move was too large, and you might try to decrease strprecon.
src_file: metric.F90

But with optcell =1 the rprimd should not be changing sign!

I have tried many things, and the latest input looks like :

ndtset 3
jdtset 1 2 3
paral_kgb 0 npband 1 npfft 1 npkpt 28
ecutsm 0.5
dilatmx 1.10
natom 10
ntypat 3
znucl 82 22 8
typat 1*1 1*2 2*3
#scalecart 9.73 9.73 14.036
scalecart 10.093073533998 10.093073533998 14.6728671828548
angdeg 3*90
spgroup 108
brvltt -1
natrd 4
xred
0.5 0 0.2515
0. 0. 0.5013
0.23 0.73 -0.04
0. 0. 0.75157
ngkpt 6 6 6
shiftk 0.5 0.5 0.5
occopt 7
nband 80
#tsmear 0.005
pawecutdg 56
pawmixdg 1
ecut 40
ixc 23
nstep 50
diemac 10.0
diemix 0.3
kptopt 1
toldff 5d-7
iscf 14
ionmov1 2
dtion 100
ionmov2 7
ionmov3 2
optcell1 1
optcell2 0
optcell3 2
ntime1 20
ntime2 20
ntime3 30
tolmxf1 1d-3
tolmxf2 5d-6
tolmxf3 1d-6
strtarget 3*-0.000679785784544 3*0.0
#restartxf=-1
strfact 100
strprecon 0.1

Occasionally by luck one of these will converge after many attempts, just by resubmitted the failed job it will sometimes not go crazy and continue OK. It seems like this must be a bug, either in abinit or the intel compiler (15.0 I believe) . It is compiled with:

fftw/serial/3.3 wannier90/1.2 gsl/1.16 libxc/2.0 netcdf/4.3 szip/2.1

We ran the test suite which mostly seems oK, but there are not many tests with PAWs and optimization, and none with more than a handful of processors that I could find. I have been trying to run with :
paral_kgb 0 npband 1 npfft 1 npkpt 28
on 28 cpus
and
paral_kgb 1 npband 5 npfft 1 npkpt 28
on 140. The later dies after some scf steps with
m_abi_linalg_mp_x 106 abi_xorthonormalize.f90

Any help would be greatly appreciated!

Ron Cohen

recohen · Post by **recohen** » Wed May 25, 2016 6:19 pm

OK, the problem I guess is compiler bugs. Adding:
-fp-model precise -fp-model source -O2 -traceback
to the compile leads to correct behavior, although I still had one job crash in parallel diagonalization. Running that job on more processors fixed that. The full configure line on superMUC that works :

/configure --prefix=/lrz/sys/applications/abinit/7.10.5/intel \
--with-mpi-libs="-ldl -lmpi -lmpigc4 -lmpigf -lmpigi -lpthread -lrt" \
--enable-64bit-flags --with-dft-flavor="atompaw+libxc+wannier90" \
--with-atompaw-bins="$ATOMPAW_BASE/bin" --with-atompaw-libs="$ATOMPAW_SHLIB" \
--with-netcdf-incs="$NETCDF_INC" --with-netcdf-libs="$NETCDF_F90_SHLIB $NETCDF_SHLIB $NETCDF_CXX_SHLIB" \
--with-etsf-io-incs="$ETSF_INC" --with-etsf-io-libs="$ETSF_LIB" \
--with-libxc-incs="-I/lrz/sys/libraries/libxc/2.0.2/include" \
--with-libxc-libs="-L/lrz/sys/libraries/libxc/2.0.2/lib -lxc" \
--with-wannier90-bins="$WANNIER90_BASE/bin" --with-wannier90-incs="$WANNIER90_INC" \
--with-wannier90-libs="$WANNIER90_LIB" --enable-mpi --enable-mpi-io --with-linalg-flavor="mkl" \
--with-linalg-incs="$MKL_INC" --with-linalg-libs="$SCALAPACK_LIB $BLACS_LIB $BLACS_LIB_C $MKL_SHLIB" \
--with-fft-flavor="fftw3-mkl" --with-fft-incs="$FFTW_INC" --with-fft-libs="$FFTW_SHLIB" \
CC=mpicc FC=mpif90 CXX=mpiCC \
CFLAGS="-fp-model precise -fp-model source -O2 -traceback" \
FCFLAGS="-fp-model precise -fp-model source -O2 -traceback" \
FCFLAGS_DEBUG="-g -traceback" FC_LDFLAGS="-parallel" --with-trio-flavor="netcdf+etsf_io" \
--with-math-flavor="gsl" --with-math-incs="$GSL_INC" --with-math-libs="$GSL_SHLIB $GSL_BLAS_SHLIB"

gmatteo · Post by **gmatteo** » Wed May 25, 2016 6:39 pm

We have observed similar problems when running structural optimizations on intel architectures with recent intel compilers.
The problem seems to be related to the optimization (vectorization) performed by the compiler that leads to erratic/non-deterministic
behaviour on the different MPI procs.

There are developers working on this issue.
For the time being, one should prevent the compiler from "miscompiling" the code by using safe compilation options as you did.

In 8.0.6 there's also a new configuration option ( --enable-avx-safe-mode) to disable AVX vectorization in problematic procedures

M

ABINIT Discussion Forums

severe multiple parallel scf and optimization problems

severe multiple parallel scf and optimization problems

Re: severe multiple parallel scf and optimization problems

Re: severe multiple parallel scf and optimization problems