abinit6.12.3 crashes in more than 1 node (Intel Comp)

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

abinit6.12.3 crashes in more than 1 node (Intel Comp)

Post by ivasan » Tue Oct 09, 2012 12:43 pm

Hello Everyone,

I have compiled abinit6.12.3 without errors and it runs properly when I am only using one computing node in our cluster (the computer node has 2 hexacore processors). The problem arises when I try to use more than one computing node: it crashes at the very beginning.

In the attached files there is all the information I could gather:

- File 'config.log': log file from the configuration of abinit. In brief: I used Intel MPi 4.0.3, Intel Compilers XE2013, Intel MKL BLACS and Intel MKL FFTW3

- File 'simul.log': log file of the simulation that crashes. I used several options within mpirun to get the information related to MPI calls since it seems that the problem is there ( -v -check_mpi -genv I_MPI_DEBUG 5).

- File 'abinit.in': the input file of the simulation I am trying to run, just in case it is meaningful. In brief, I want to generate the WFK necessary for a subsequent run to generate a KSS file. This input file works fine if I only use one computing node, so I don't think that the problem is here.

It seems from the log that the errors are related to MPI since there are messages such as:

Code: Select all

[23] ERROR: LOCAL:MPI:CALL_FAILED: error
[23] ERROR:    Null communicator.
[23] ERROR:    Error occurred at:
[23] ERROR:       mpi_comm_rank_(comm=MPI_COMM_NULL, *rank=0x29319b8, *ierr=0x7fff83fabb74)
[23] ERROR:       initmpi_grid_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/51_manage_mpi/initmpi_grid.F90:178)
[23] ERROR:       invars1_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1.F90:1015)
[23] ERROR:       invars1m_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1m.F90:186)
[23] ERROR:       m_ab6_invars_mp_ab6_invars_load_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/m_ab6_invars_f90.F90:548)
[23] ERROR:       MAIN__ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/98_main/abinit.F90:260)
[23] ERROR:       main (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)
[23] ERROR:       (/lib64/libc-2.5.so)
[23] ERROR:       (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)


I will appreciate if anyone could give me a hint about what can I check/modify in order to solve this problem.

Thank you very much in advance for your answers and your time.

Kind regards,

Iván
Attachments
config.log
Log file from ./configure <options>
(178.01 KiB) Downloaded 428 times
abinit.in
Input file of the simulation that crashes
(17.95 KiB) Downloaded 376 times
simul.log
Log file from the simulation that crashes
(139.24 KiB) Downloaded 386 times
Last edited by ivasan on Wed Oct 31, 2012 5:38 pm, edited 5 times in total.
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

User avatar
pouillon
Posts: 651
Joined: Wed Aug 19, 2009 10:08 am
Location: Spain
Contact:

Re: abinit6.12.3 crashes when using several computing nodes

Post by pouillon » Tue Oct 09, 2012 1:42 pm

Your problem very likely comes from a bug of Intel 13.0 compilers. In general, I strongly advise against using *.0 versions of Intel compilers, as they usually contain many bugs.

If you really need to use Intel compiilers, recompiling everything with a recent Intel 12.1 installation should solve the issue. I'm currently using it on some machines without any major issue.

Please also remember that "more recent" doesn't always mean "better". New major versions of any software usually have teething problems, and you should use them only if you are willing to contribute to their debugging.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: abinit6.12.3 crashes when using several computing nodes

Post by ivasan » Tue Oct 09, 2012 2:12 pm

Dear Yann,

Thanks for your fast response. I will try to use the version of the Intel compilers you mention.

I will let you know how it goes.

Thanks!!!

Iván
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: abinit6.12.3 crashes when using several computing nodes

Post by ivasan » Fri Oct 26, 2012 4:26 pm

Dear Yann,

I have been doing some more tests and somehow I managed to run the job in more than one computing node ... but I am quite surprised with the solution I found.

It seems that the problem was in the input file, not in the configuration of my cluster. The input file of the simulation I want to run has these values for ionmov and optcell:

Code: Select all

ionmov=2
optcell=2

With these values abinit only runs in one computing node (in both 6.10.3 and 6.12.3). However, if I change optcell to

Code: Select all

ionmov=2
optcell=0

it perfectly runs in more than one computing node.

Is this behavior normal? I couldn't find any related warning in the manual about the limitations of optcell.

Best regards,

Iván
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: [SOLVED] abinit6.12.3 crashes in more than 1 node (Intel

Post by ivasan » Fri Oct 26, 2012 5:59 pm

Dear Yann,

I managed to solve the problem. It seems that it was related to the stack.

The solution was adding

Code: Select all

FCFLAGS_EXTRA=" -heap-arrays" CFLAGS_EXTRA="-heap-arrays"


In the configure, which is now:

Code: Select all

./configure --prefix="/home/ivasan/programas/abinit/abinit-6.12.3" CC="/opt/intel/impi/4.1.0/intel64/bin/mpiicc" CXX="/opt/intel/impi/4.1.0/intel64/bin/mpiicpc" FC="/opt/intel/impi/4.1.0/intel64/bin/mpiifort" --enable-mpi --enable-mpi-io  MPI_RUNNER="/opt/intel/impi/4.1.0/intel64/bin/mpirun" --with-mpi-libs="-L/opt/intel/impi/4.1.0/intel64/lib -lmpi" --with-mpi-incs="-I/opt/intel/impi/4.1.0/intel64/include" --with-linalg-flavor="mkl+scalapack" --with-linalg-incs="-I/opt/intel/mkl/include/intel64/" --with-fft-flavor="fftw3-mkl" --with-fft-incs="-I/opt/intel/mkl/include/fftw" --with-fft-libs="-L/opt/intel/mkl/interfaces/fftw3xf -lfftw3xf_intel" --enable-gw-dpc --with-linalg-libs="-L/opt/intel/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm" FCFLAGS_EXTRA=" -heap-arrays" CFLAGS_EXTRA="-heap-arrays"


As you can see, I use:
- Intel Compilers XE2013 (V13.0)
- Intel MPI 4.1
- LAPACK, BLACS, SCALAPACK, FFTW from Intel MKL 11.0

As far as I could have tested, there are no more problems.

However, I am not quite sure about the option "-heap-arrays" in the compilation. I had a bad experience with other program: the jobs consumed all the memory at the computing nodes, and the nodes crashed.

Do you think that this can also happen in abinit?

Best regards,

Iván
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

User avatar
Alain_Jacques
Posts: 279
Joined: Sat Aug 15, 2009 9:34 pm
Location: Université catholique de Louvain - Belgium

Re: SOLVED-abinit6.12.3 crashes in more than 1 node (Intel C

Post by Alain_Jacques » Sun Oct 28, 2012 6:44 pm

Hi Ivan,

I assume that you solved the problem by using the heap instead of the stack when calling functions - stack is the default for recent compilers. OK fine ... that's the former behaviour, nothing wrong here.
But using the stack has a few advantages (speed, sort of garbage collection, sort of protection against memory leaks, ...) so it's a bit of a pity to revert to heap before trying to extend the stacksize if your cluster nodes crash because the memory allocated to the stack is exhausted. Did you try to use

Code: Select all

ulimit -s unlimited
- to be included in your preferred batch script. Or simply see what

Code: Select all

ulimit -a
on one node returns? IMHO this cures half the obscure segfault cases on Unix boxes

Kind regards,

Alain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: SOLVED-abinit6.12.3 crashes in more than 1 node (Intel C

Post by ivasan » Wed Oct 31, 2012 10:06 am

Hi Alain,

Thanks for your suggestion. I saw it in some other places, but in my case

Code: Select all

ulimit -s ulimited

does not work. Before using the -heap-arrays option, I added this command to my .bash_profile and this is what I get with 'ulimit -a' in any of the nodes on my cluster

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1587200
max locked memory       (kbytes, -l) 2000000
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1587200
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

but abinit was still not running in more than one node.

I have experienced this problem with another similar program I am using, and I could solve it after some "soft coding": I had to add this routine:

Code: Select all

#include <sys/time.h> 
#include <sys/resource.h>
#include <stdio.h>
void stacksize_()
{
int res;
struct rlimit rlim;

getrlimit(RLIMIT_STACK, &rlim);
printf("Before: cur=%d,hard=%d\n",(int)rlim.rlim_cur,(int)rlim.rlim_max);

rlim.rlim_cur=RLIM_INFINITY;
rlim.rlim_max=RLIM_INFINITY;
res=setrlimit(RLIMIT_STACK, &rlim);

getrlimit(RLIMIT_STACK, &rlim);
printf("After: res=%d,cur=%d,hard=%d\n",res,(int)rlim.rlim_cur,(int)rlim.rlim_max);
}

to the compilation, and call the function stacksize() at the very beginning of the main file of the program that had the stack problem. This worked in that case (by the way, in that case the option 'ulimit -s unlimited' didn't work either), so I didn't have to use the option '-heap-arrays' in the compilation.

I tried to find out how to implement this method in abinit ... but I am quite new with abinit and I don't understand well its structure.

I am aware of the 'dangers' of the option '-heap-arrays', but at the moment it is the only solution that I have. If you or anyone else can help with this in order to avoid '-heap-arrays' it will be great.

Best regards,

Iván
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: abinit6.12.3 crashes in more than 1 node (Intel Comp)

Post by ivasan » Wed Oct 31, 2012 5:40 pm

Hi all,

I am still having the problem. When abinit compiled with "-heap-arrays" is about to finish, it crashes with the following message:

Code: Select all

m_wffile.F90:279:COMMENT
   MPI/IO accessing FORTRAN file header: detected record mark length=4
 ioarr: data written to disk file cSi216I3Jxo_TIM1_DEN
 bonds_lgth_angles : about to open file cSi216I3Jxo_TIM1_GEO
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)


It seems to be related with the bus of data transfer.

Best regards,

Iván
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

ivasan
Posts: 11
Joined: Mon May 09, 2011 5:50 pm

Re: abinit6.12.3 crashes in more than 1 node (Intel Comp)

Post by ivasan » Wed Oct 31, 2012 7:20 pm

Hello again,

Definitively "-heap-arrays" is not a solution, I still get errors.

Coming back to the compilation without this option:

Code: Select all

./configure --prefix="/home/ivasan/programas/abinit/abinit-6.12.3-stack" CC="/opt/intel/impi/4.1.0/intel64/bin/mpiicc" CXX="/opt/intel/impi/4.1.0/intel64/bin/mpiicpc" FC="/opt/intel/impi/4.1.0/intel64/bin/mpiifort" --enable-mpi --enable-mpi-io  MPI_RUNNER="/opt/intel/impi/4.1.0/intel64/bin/mpirun" --with-mpi-libs="-L/opt/intel/impi/4.1.0/intel64/lib -lmpi" --with-mpi-incs="-I/opt/intel/impi/4.1.0/intel64/include" --with-linalg-flavor="mkl+scalapack" --with-linalg-incs="-I/opt/intel/mkl/include/intel64/" --with-fft-flavor="fftw3-mkl" --with-fft-incs="-I/opt/intel/mkl/include/fftw" --with-fft-libs="-L/opt/intel/mkl/interfaces/fftw3xf -lfftw3xf_intel" --enable-gw-dpc --with-linalg-libs="-L/opt/intel/mkl/lib/intel64 -lmkl_blas95_lp64 -lmkl_lapack95_lp64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm"


I have tried to run a very simple simulation (file abinit.2.in) of a supercell relaxation using 12 cores in two different nodes, but it crashes (I am using the LDA psp 14-Si.LDA.fhi). The very same simulation ends without problems when only one node is used.

I have attached two different log files with some MPI information:
- simul.2.log: obtained with the mpirun options -v -genv I_MPI_DEBUG 100
- simul.2.checkmpi.log: obtained with mpirun options -check-mpi -v -genv I_MPI_DEBUG 100.

With respect to the STATUS files after the simulation, some of them contain this information:

Code: Select all

Status file, with repetition rate  49, status number    10
 
  Level abinit         : call macroin2

while other contain this:

Code: Select all

 Status file, with repetition rate  49, status number    50
 
  Level abinit         : call driver   
  Level driver         : call gstateimg
  Level gstateimg      : enter         
  Level gstate         : call mover   
  Level scfcv          : call vtorho   
  istep      =    1
  Level vtorho(tf)     : loop ikpt     
  isppol     =    1
  ikpt       =    6


Now errors are from the processes of the slave node, and they point to gstate routine:

Code: Select all

[6] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[6] ERROR:    Fatal signal 11 (SIGSEGV) raised.
[6] ERROR:    Signal was encountered at:
[6] ERROR:       gstate_ (/home/ivasan/programas/abinit/abinit-6.12.3-stack/src/95_drive/gstate.F90:1041)
[6] ERROR:    After leaving:
[6] ERROR:       mpi_comm_rank_(comm=MPI_COMM_WORLD, *rank=0x7fff1a744870->6, *ierr=0x7fff1a744874->MPI_SUCCESS)


Does anyone see anything wrong?

Best regards,

Iván
Attachments
abinit.2.in
Input file
(18.34 KiB) Downloaded 422 times
simul.2.log
Log file from -v -genv I_MPI_DEBUG 100
(233.48 KiB) Downloaded 392 times
simul.2.checkmpi.log
Log file from -check-mpi -v -genv I_MPI_DEBUG 100
(206.4 KiB) Downloaded 375 times
Dr. Iván Santos
Dpto. Electricidad y Electrónica
Universidad de Valladolid, Spain

Locked