autoparal file writing gives MPI ABORT (cluster install)
Posted: Thu Jul 04, 2019 5:22 am
I compiled abinit-8.10.2 with mpi and with mpi-io on one of the XSEDE clusters (comet to be specific).
I used spack for compilation, and the install was in my local directory (no system wide install).
The package appears to be working fine in serial mode.
However, when I have a multi-dataset run (ndtset more than 1), with autoparal 1, the first dataset finishes calculation successfully, then it tries to write data files (e.g., whatever_xo_DS1_DEN). At this point, it gives and MPI ABORT. There are some cryptic messages, one of which is 'whatever_xo_DS1_DEN file does not exist'.
My understanding is that mpi execution is fine, but something is wrong with mpi-io.
Actually this error occurs with system-wide abinit install also (which was installed by cluster admins and is an old version). So my guess is that my compile and install is not bad?
Am I not loading some crucial mpi-io related module?
For my abinit, I load the following modules before running abinit: gnutools, intel, mvapich2_ib, fftw, libxc, abinit/8.10.2.
For system-wide abinit, I simply do a 'module load abinit' and it just works (serial part at least).
Any help highly appreciated, especially by folks who are familiar with XSEDE clusters in general, and comet cluster in particular?
I used spack for compilation, and the install was in my local directory (no system wide install).
The package appears to be working fine in serial mode.
However, when I have a multi-dataset run (ndtset more than 1), with autoparal 1, the first dataset finishes calculation successfully, then it tries to write data files (e.g., whatever_xo_DS1_DEN). At this point, it gives and MPI ABORT. There are some cryptic messages, one of which is 'whatever_xo_DS1_DEN file does not exist'.
My understanding is that mpi execution is fine, but something is wrong with mpi-io.
Actually this error occurs with system-wide abinit install also (which was installed by cluster admins and is an old version). So my guess is that my compile and install is not bad?
Am I not loading some crucial mpi-io related module?
For my abinit, I load the following modules before running abinit: gnutools, intel, mvapich2_ib, fftw, libxc, abinit/8.10.2.
For system-wide abinit, I simply do a 'module load abinit' and it just works (serial part at least).
Any help highly appreciated, especially by folks who are familiar with XSEDE clusters in general, and comet cluster in particular?