Page 1 of 1

parallel SCF crash for big systems (also with prtwf=0)

Posted: Mon Jun 18, 2012 4:28 pm
by elena.mol
Dear all,
I have found problems in SCF runs for “big” systems. My post is related to previous posts regarding “SCF crash when writing WFK” and similar ones, but is a bit more general, since in some cases I have a crash also with prtwf=0.

These are the general conclusions I can draw from many tests, on different molecules, different machines (our local cluster, CINECA PLX: http://www.hpc.cineca.it/hardware/ibm-plx, and, in the past, CINECA SP6 which is now no longer active) and several choices for parallelization keywords (I always have nkpt=1). I mostly use abinit6.10.3, but people from the CINECA supercomputing center (http://www.hpc.cineca.it/) have found a similar behaviour with abinit6.12.3, and I found the same problems some time ago with older abinit versions.
1) when the system size (determined by nband, acell, ecut) exceeds some threshold, which depends on the machine one is using, a parallel SCF run, using npband only, will crash when writing the WFK file, i.e. at convergence (or after “nstep” steps), after writing a correct DEN file.
2) This problem can be avoided, for a certain range of nband, acell, ecut, by using prtwf=0, and then performing a second run, with prtwf=1, iscf=-2, nstep=1, irdden=1, which reads the DEN file created by the first run, and writes a WFK file. (I hope this procedure is correct also from a “physical” point of view.. does anyone have any suggestions?)
3) Increasing nband and/or acell and/or ecut further, one reaches a second “threshold” beyond which the “npband only” SCF run will crash also with prtwf=0. In these cases, there are some values of npfft, bandpp which, together with npband, will yield a correctly running calculation. Not all the combinations of npband, npfft, bandpp indicated by abinit as having an “optimal weight” in a paral_kgb= -n test run will work, and it is not clear to me how to identify the “correct” sets of npband,npfft,bandpp

These crashes happen without any error message directly related to insufficient memory (usually without any clear error message at all), and the calculations, as long as they are running, use quite less than the allocated/available memory.

The typical last lines of the log files, for the crashed runs, are of this kind:
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
rank 3 in job 1 node168ib0_49519 caused collective abort of all ranks
exit status of rank 3: killed by signal 11


It seems that, at least in some of the above mentioned cases, where the parallel SCF run is crashing, the corresponding serial calculation can run without problems, but, of course, taking a very long time! (I tried the serial version for very few cases)


Does anyone know how to avoid these problems?

Here are the configure options I used for abinit6.10.3. on CINECA PLX:
module purge

module load profile/advanced

module load intel/co-2011.2.137--binary

module load IntelMPI/4.0--binary

module load netcdf/3.6.2--intel--12.1--binary

module load mkl/10.2.2--binary


./configure --enable-mpi -enable-mpi-io --prefix=/gpfs/scratch/userexternal/emolteni/abinit-6.10.3_PLX_MKLlib FC=mpif90 CC=mpicc CXX=mpicxx FCFLAGS='-O2' CCFLAGS=' -O2' --with-netcdf-incs="-I$NETCDF_INC" --with-netcdf-libs="-L$NETCDF_LIB -lnetcdf -lnetcdf_c++" --with-trio-flavor="netcdf" --with-linalg-libs="-L$MKL_HOME/lib/em64t -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm" --with-linalg-incs="-I$MKL_HOME/include/em64t/lp64 -I$MKL_HOME/include" --with-linalg-flavor="custom"


I'm attaching (all from abinit6.10.3 on CINECA PLX):
* big_run1.in: input file for a system which works using prtwf=0 and with (at least) the following 2 sets of parall keywords: npband=5,npfft=4,bandpp=4, or npband=10, npfft=4, bandpp=2.

* big_run1_crash.log: log file of the crashed run for the same system, but with npband=5, npfft=1, bandpp=4.

* very_big_run1.in: input file for a system with acell and nband larger than in big_run1, and for which I have not found yet any set of keywords to avoid the crash.

(of course I put nstep=1 in the test runs in order to spare time and cpu resources, since it's clear that the crash, if any, happens at the last SCF step)

Thanks a lot in advance
cheers
Elena

Elena Molteni
Department of Physics
University of Milan
via Celoria, 16
I-20133, Milan, Italy
and European Theoretical Spectroscopy Facility (ETSF)
http://www.etsf.eu