mkl13+mvapich2 1.8.1 segmentation fault

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
samuel.fux
Posts: 1
Joined: Wed Dec 10, 2014 8:53 am

mkl13+mvapich2 1.8.1 segmentation fault

Post by samuel.fux » Wed Dec 10, 2014 9:09 am

Hi,

I tried to compile abinit 7.8.1 with intel 13.1.1 (+mkl) and openmpi 1.6.5. When running a serial job it works, when running in parallel I sometimes get a segmentation fault. Then I switched to abinit 7.10.1 -> same problem. Switching from openmpi 1.6.5 to mvapich2 1.8.1 -> same problem.

Configuration for intel 13.1.1 and mvapich2 1.8.1 (with increased log level):

./configure CC=mpicc CXX=mpicxx FCFLAGS_EXTRA="-g -O0 -check all -traceback" --prefix=/cluster/apps/abinit/test7.10.1/x86_64 --enable-debug=naughty --enable-openmp --with-wannier90-bins=/cluster/apps/abinit/test7.10.1/wannier90 --with-wannier90-libs="-L/cluster/apps/abinit/test7.10.1/wannier90 -lwannier" --enable-64bit-flags --enable-mpi --enable-fast-check --enable-mpi-io --with-mpi-prefix="$MPI_ROOT" --with-fft-flavor="fftw3-mpi" --with-fft-incs="-I/cluster/apps/mvapich2/1.8.1/x86_64/intel_13.1.1/include" --with-fft-libs="-L/cluster/apps/mvapich2/1.8.1/x86_64/intel_13.1.1/lib64 -lfftw3 -lfftw3_mpi" --with-dft-flavor="wannier90" --with-timer-flavor="abinit" --with-linalg-flavor="mkl+scalapack" --with-linalg-incs="-I/$MKLROOT/include" --with-linalg-libs="-L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_sequential -lmkl_core" --enable-optim CFLAGS_OPTIM="-O1" CXXFLAGS_OPTIM="-O1" FCFLAGS_OPTIM="-O1"

Compilation never gave any error. Any idea what the problem could be ?

Logs from the failed run, starting with the part where mpi is started:

mpi_setup@mpi_setup.F90:111 >>>>> ENTER

initmpi_seq@initmpi_seq.F90:65 >>>>> ENTER

initmpi_seq@initmpi_seq.F90:129 >>>>> EXIT

initmpi_img@initmpi_img.F90:85 >>>>> ENTER

initmpi_img@initmpi_img.F90:348 >>>>> EXIT

initmpi_seq@initmpi_seq.F90:65 >>>>> ENTER

initmpi_seq@initmpi_seq.F90:129 >>>>> EXIT

finddistrproc@finddistrproc.F90:144 >>>>> ENTER

kpgcount@m_fftcore.F90:3717 >>>>> ENTER

kpgcount@m_fftcore.F90:3763 >>>>> EXIT

getmpw sequential formula gave: 70247

Computing all possible proc distrib for this input with nproc less than 4
npimage| npkpt| npspinor| npfft| npband| bandpp | nproc| weight|
1 -> 1| 1 -> 4| 1 -> 1| 1 -> 4| 1 -> 4| 1 -> 12| 2 -> 4| 1 -> 4|
1| 4| 1| 1| 1| 1| 4| 3.91 |
1| 2| 1| 2| 1| 1| 4| 3.55 |
1| 2| 1| 1| 2| 1| 4| 3.55 |
1| 1| 1| 4| 1| 1| 4| 3.31 |
1| 1| 1| 2| 2| 1| 4| 3.31 |

Values below have been tested with respect to Linear Algebra performance;
Weights below are corrected according:
npimage| npkpt| npspinor| npfft| npband| bandpp | nproc| weight|new weight|
compute_kgb_indicator@compute_kgb_indicator.F90:95 >>>>> ENTER

compute_kgb_indicator : (bpp,npb,npf) = 1 1 2
init_scalapack@m_slk.F90:401 >>>>> ENTER

build_grid_scalapack@m_slk.F90:266 >>>>> ENTER

build_grid_scalapack@m_slk.F90:283 >>>>> EXIT

build_processor_scalapack@m_slk.F90:331 >>>>> ENTER

build_processor_scalapack@m_slk.F90:353 >>>>> EXIT

init_scalapack@m_slk.F90:410 >>>>> EXIT

init_matrix_scalapack@m_slk.F90:514 >>>>> ENTER

init_matrix_scalapack@m_slk.F90:574 >>>>> EXIT

init_matrix_scalapack@m_slk.F90:514 >>>>> ENTER

init_matrix_scalapack@m_slk.F90:574 >>>>> EXIT

init_matrix_scalapack@m_slk.F90:514 >>>>> ENTER

init_matrix_scalapack@m_slk.F90:574 >>>>> EXIT

Boundary Run-Time Check Failure for variable 'm_slk_mp_compute_generalized_eigen_problem_$RWORK_TMP'

Boundary Run-Time Check Failure for variable 'm_slk_mp_compute_generalized_eigen_problem_$RWORK_TMP'

forrtl: error (76): Abort trap signal
Image PC Routine Line Source
libc.so.6 00002AC271DFE625 Unknown Unknown Unknown
libc.so.6 00002AC271DFFE05 Unknown Unknown Unknown
libirc.so 00002AC271B89D2F Unknown Unknown Unknown
abinit 000000000F9FF976 m_slk_mp_compute_ 2714 m_slk.F90
abinit 000000000FA05FC6 m_slk_mp_compute_ 2977 m_slk.F90
abinit 000000000F77AB42 m_abi_linalg_mp_a 122 abi_xhegv.f90
abinit 000000000F77C248 m_abi_linalg_mp_a 221 abi_xhegv.f90
abinit 00000000098F8EDE compute_kgb_indic 209 compute_kgb_indicator.F90
abinit 000000000963E2E1 finddistrproc_ 793 finddistrproc.F90
abinit 0000000008FD072C mpi_setup_ 213 mpi_setup.F90
abinit 000000000041043D MAIN__ 330 abinit.F90
abinit 000000000040D38C Unknown Unknown Unknown
libc.so.6 00002AC271DEAD5D Unknown Unknown Unknown
abinit 000000000040D289 Unknown Unknown Unknown
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
libc.so.6 00002B3E3F78A625 Unknown Unknown Unknown
libc.so.6 00002B3E3F78BE05 Unknown Unknown Unknown
libirc.so 00002B3E3F515D2F Unknown Unknown Unknown
abinit 000000000F9FF976 m_slk_mp_compute_ 2714 m_slk.F90
abinit 000000000FA05FC6 m_slk_mp_compute_ 2977 m_slk.F90
abinit 000000000F77AB42 m_abi_linalg_mp_a 122 abi_xhegv.f90
abinit 000000000F77C248 m_abi_linalg_mp_a 221 abi_xhegv.f90
abinit 00000000098F8EDE compute_kgb_indic 209 compute_kgb_indicator.F90
abinit 000000000963E2E1 finddistrproc_ 793 finddistrproc.F90
abinit 0000000008FD072C mpi_setup_ 213 mpi_setup.F90
abinit 000000000041043D MAIN__ 330 abinit.F90
abinit 000000000040D38C Unknown Unknown Unknown
libc.so.6 00002B3E3F776D5D Unknown Unknown Unknown
abinit 000000000040D289 Unknown Unknown Unknown

=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)

Jordan
Posts: 282
Joined: Tue May 07, 2013 9:47 am

Re: mkl13+mvapich2 1.8.1 segmentation fault

Post by Jordan » Wed Dec 10, 2014 5:13 pm

If I were you I would try with much less options.
-> remove wannier90
-> remove fft3-mpi and use fft3 from MKL instead
-> remove openmp
-> try without scalapack, just MKL

Once you got something working, add one by one the above features if you need them.

Cheers,

Jordan

Locked