abinit7.0.4 segmentation fault - compiled with intel11.1 mpi

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
Mansour
Posts: 2
Joined: Thu Jan 24, 2013 12:45 pm

abinit7.0.4 segmentation fault - compiled with intel11.1 mpi

Post by Mansour » Thu Jan 24, 2013 2:02 pm

Hello
This is first time I am using abinit; but anyway I have compiled its 7.0.4 version on Linux 2.6.32-279.9.1.el6.x86_64 (Red Hat 4.4.6-4) with a simple mpi-prefix option as :
../abinit-7.0.4/configure --enable-mpi --with-mpi-prefix=/cluster/mpi/openmpi/1.6.3/intel/
Of course I loaded related modules before:
Currently Loaded Modulefiles:
1) intel/11.1046 2) mpi/intel/openmpi/1.6.3

there were some warnings but it could finish configuration finally. After that, by make -j4 final step was done.
Unfortunately, when I try to test it by make tests_in or tests_min lots of segmentation errors came out. I also submit it on cluster with 8 cpus on an nod (for tbasepar_1test case) but same problem was occurred (i.e. segmentation fault).
The last part of its log file is like:

Code: Select all

 pspatm: atomic psp has been read  and splines computed

   2.39408461E+03                                ecore*ucvol(ha*bohr**3)
 symatm: atom number     1 is reached starting at atom
   1  3  2  4  1  3  2  4  1  3  2  4  1  3  2  4  1  3  2  4  1  3  2  4
 symatm: atom number     2 is reached starting at atom
   2  4  1  3  2  4  1  3  4  2  3  1  3  1  4  2  3  1  4  2  4  2  3  1
 symatm: atom number     3 is reached starting at atom
   3  1  4  2  4  2  3  1  2  4  1  3  2  4  1  3  4  2  3  1  3  1  4  2
 symatm: atom number     4 is reached starting at atom
   4  2  3  1  3  1  4  2  3  1  4  2  4  2  3  1  2  4  1  3  2  4  1  3
 newkpt: spin channel isppol2 =     1
-P-0000  wfconv:    40 bands initialized randomly with npw=   675, for ikpt=     1
 newkpt: spin channel isppol2 =     2
 newkpt: loop on k-points done in parallel
-P-0000  leave_test : synchronization done...
 pareigocc : MPI_ALLREDUCE

 setup2: Arith. and geom. avg. npw (full set) are     675.000     675.000
 initro : for itypat=  1, take decay length=      0.8000,
 initro : indeed, coreel=     18.0000, nval=  8 and densty=  0.0000E+00.

================================================================================

 getcut: wavevector=  0.0000  0.0000  0.0000  ngfft=  36  36  36
         ecut(hartree)=     30.000   => boxcut(ratio)=   2.08583

 ewald : nr and ng are    3 and   15

 ITER STEP NUMBER     1
 vtorho : nnsclo_now=  2, note that nnsclo,dbl_nnsclo,istep=  0 0  1
  ****  In vtorho for isppol=           1
  ****  In vtorho for isppol=           2
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 25632 on node node156.cluster.icams.ruhr-uni-bochum.de exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I read other posts especially this http://forum.abinit.org/viewtopic.php?f=3&t=1514&start=25 but I couldn't realize the problem yet. While the discussed OS was different in that post but I guess in my case also a wrong mixture of different libraries was chosen.
I attached config.log file which includes required information about my platform and probable wrong options.

Thanks in advance for your helps.
Mansour
Attachments
config.log
(152.84 KiB) Downloaded 376 times

User avatar
Alain_Jacques
Posts: 279
Joined: Sat Aug 15, 2009 9:34 pm
Location: Université catholique de Louvain - Belgium

Re: abinit7.0.4 segmentation fault - compiled with intel11.1

Post by Alain_Jacques » Thu Jan 24, 2013 3:58 pm

Hi Mansour,

I suspect that your abinit has been linked to a flaky linear algebra library as configure Russian roulette seems to find one apparently suitable to replace the fallback NETLIB. As a consequence the binary crashes at the first call of a dot product function - you can check this with a run of gdb.
Anyway there are several ways to circumvent this by trying either:

* to configure with --with-linalg-flavor="none", recompile and test with a make tests_v1 - this is the safest choice but there is a speed penalty.
* or to configure with --enable-zdot-bugfix="yes", recompile and re-check (may use your enhanced lib but wannier90 is probably broken)
* or use MKL libraries for BLAS, LAPACK and FFT, probably the most efficient choice.

Kind regards,

Alain

Mansour
Posts: 2
Joined: Thu Jan 24, 2013 12:45 pm

Re: abinit7.0.4 segmentation fault - compiled with intel11.1

Post by Mansour » Fri Jan 25, 2013 12:05 pm

Dear Alain

Thank you very much for your quick reply. Yes, it does work. In fact by that linalg-flavor flag it set the NETLIB for linear algebra package correctly.
../abinit-7.0.4/configure --enable-mpi=yes --enable-optim=yes --with-mpi-prefix=/cluster/mpi/openmpi/1.6.3/intel/ --with-linalg\
-flavor=none


But I am suspected whether its version is proper for mpi calculation or not. As I saw during make command the
sequential version of linear algebra package
has been chosen. Dose it really matter for parallel calculation? (new config.log is attached)

I also run tests_v1 successfully. The tutorial first test case tbasepar_1 was also fine with different number of cpus but I was confused with results for second test case tbasepar_2 which is considered spin calculation. In fact its results is depending on number of cpus!. I checked with different number of cpus (1,8 and 16); the first one's result was quite same as reference output file but for two next cases the calculated energy and stress tensor components have significant deviation from reference value (+-10 times higher). While results for 8 and 16 cpus are the same. The question is that, treating this case with larger number of cpus causes the problem (because it has too few K-points) or is there any mistake related to code compiling?

The comparison by sdiff command between 8-cpu and reference output files is shown here: (both are attached also)

Code: Select all

-------------------------------------------------------------   -------------------------------------------------------------
 Components of total free energy (in Hartree) :                  Components of total free energy (in Hartree) :

    Kinetic energy  =  1.57704767360493E+02                   |     Kinetic energy  =  1.25650794248530E+02
    Hartree energy  =  1.82744721367767E+01                   |     Hartree energy  =  1.44063774642969E+01
    XC energy       = -2.01317375494705E+01                   |     XC energy       = -1.70626672615728E+01
    Ewald energy    = -8.36499227815283E+01                         Ewald energy    = -8.36499227815283E+01
    PspCore energy  =  6.97983851565236E+00                         PspCore energy  =  6.97983851565236E+00
    Loc. psp. energy= -5.37285622189003E+01                   |     Loc. psp. energy= -4.83382491267464E+01
    NL   psp  energy= -1.00332543896596E+02                   |     NL   psp  energy= -7.79796526379097E+01
    >>>>> Internal E= -7.48836884335736E+01                   |     >>>>> Internal E= -7.99934815792781E+01

    -kT*entropy     = -2.32062757842293E-02                   |     -kT*entropy     = -9.13291316865402E-02
    >>>>>>>>> Etotal= -7.49068947093578E+01                   |     >>>>>>>>> Etotal= -8.00848107109647E+01

 Other information on the energy :                               Other information on the energy :
    Total energy(eV)= -2.03832026569694E+03 ; Band energy (Ha |     Total energy(eV)= -2.17921852561150E+03 ; Band energy (Ha
-------------------------------------------------------------   -------------------------------------------------------------

 Cartesian components of stress tensor (hartree/bohr^3)          Cartesian components of stress tensor (hartree/bohr^3)
  sigma(1 1)= -9.13362800E-03  sigma(3 2)=  0.00000000E+00    |   sigma(1 1)=  2.37971568E-02  sigma(3 2)=  0.00000000E+00
  sigma(2 2)= -9.13362800E-03  sigma(3 1)=  0.00000000E+00    |   sigma(2 2)=  2.37971568E-02  sigma(3 1)=  0.00000000E+00
  sigma(3 3)= -9.13362800E-03  sigma(2 1)=  0.00000000E+00    |   sigma(3 3)=  2.37971568E-02  sigma(2 1)=  0.00000000E+00
                                                              |
-Cartesian components of stress tensor (GPa)         [Pressur | -Cartesian components of stress tensor (GPa)         [Pressur
- sigma(1 1)= -2.68720568E+02  sigma(3 2)=  0.00000000E+00    | - sigma(1 1)=  7.00136408E+02  sigma(3 2)=  0.00000000E+00
- sigma(2 2)= -2.68720568E+02  sigma(3 1)=  0.00000000E+00    | - sigma(2 2)=  7.00136408E+02  sigma(3 1)=  0.00000000E+00
- sigma(3 3)= -2.68720568E+02  sigma(2 1)=  0.00000000E+00    | - sigma(3 3)=  7.00136408E+02  sigma(2 1)=  0.00000000E+00

== END DATASET(S) ===========================================   == END DATASET(S) ===========================================
=============================================================   =============================================================

 -outvars: echo values of variables after computation  ------    -outvars: echo values of variables after computation  ------
        accesswff           1                                 <
            acell      7.0000000000E+00  7.0000000000E+00  7.               acell      7.0000000000E+00  7.0000000000E+00  7.
              amu      5.58470000E+01                                         amu      5.58470000E+01
           bandpp           2                                 <
             ecut      3.00000000E+01 Hartree                                ecut      3.00000000E+01 Hartree
           etotal     -7.4906894709E+01                       |            etotal     -8.0084810711E+01
            fcart     -1.4955468720E-01 -1.4955468720E-01 -1. |             fcart     -1.4955719002E-02 -1.4955719002E-02 -1.
                       1.4955468720E-01  1.4955468720E-01 -1. |                        1.4955719002E-02  1.4955719002E-02 -1.
                       1.4955468720E-01 -1.4955468720E-01  1. |                        1.4955719002E-02 -1.4955719002E-02  1.
                      -1.4955468720E-01  1.4955468720E-01  1. |                       -1.4955719002E-02  1.4955719002E-02  1.


Regarding your comment to use mkl libraries I checked and found its proper BLAS and LAPACK libraries for intel but not for fftw; instead, fftw 3.2.2-intel was separately installed.
could I use BLAS and LAPACK from mkl and this fftw ? Also, dose it better (more efficient) than current NETLIB I have?

And final question, same procedure should I use for ABINIT7.0.5?

Best Regards

Mansour
Attachments
config.log
new config.log
(143.48 KiB) Downloaded 318 times
tbasepar_2.out
with 8-cpu calculated
(30.28 KiB) Downloaded 327 times

User avatar
Alain_Jacques
Posts: 279
Joined: Sat Aug 15, 2009 9:34 pm
Location: Université catholique de Louvain - Belgium

Re: abinit7.0.4 segmentation fault - compiled with intel11.1

Post by Alain_Jacques » Fri Jan 25, 2013 6:43 pm

Dear Mansour,

When I perform parallel calculations using MPI, I generally switch to sequential linear algebra in order to avoid CPU overloading. With multithreaded blas/lapack, if you start a job on a 4 cores node by specifying mpirun -np 4 ... and each of these MPI slots open concurrent threads, you end up with more than one task per core which is inefficient due to context switching. It is difficult to control two different parallelization technologies ignoring each other. So if a MPI program has been compiled with multithreaded MKL libs, I use the info on http://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications to switch to sequential MKL i.e. export MKL_NUM_THREADS=1. If you're cautious when treating a problem with poor MPI parallelization but a large amount of data that may benefit from faster blas/lapack, you may keep multithreading but you have to be sure that (mpirun -np) x (MKL_NUM_THREADS=) <= allocated cores.

Now concerning the difference between 1, 2, 8, ... 16 cpus on tbasepar_2 ... You have to look at converged results. I see in the input file nstep=5 ... no way to reach convergence with such a small number of SCF loops. Retry with nstep=200 (maybe even more) and compare the results obtained with a sequential run and a parallel run on 2 processes (for spin up/down) ; they should match but be sure that both runs are converged. To use more cpus, also increase ngktp and again do comparisons between converged outputs.

MKL provides BLAS, LAPACK and fast fourier transform and is arguably the most efficient library. As an example, I use the following extra options for a production abinit ...

Code: Select all

--enable-debug="no"
--enable-optim="aggressive"
--with-fft-flavor="fftw3"
--with-fft-libs="-L/opt/intel/Compiler/11.1/073/mkl/lib/em64t -Wl,--start-group -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread"
--with-linalg-flavor="mkl"
--with-linalg-libs="-L/opt/intel/Compiler/11.1/073/mkl/lib/em64t -Wl,--start-group -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread"

Adapt to your paths, make and test (yeah, fftw3 flavor for MKL fft). If you experience the same segfaults, add --enable-zdot-bugfix="yes" and disable wannier90.

Kind regards,

Alain

Locked