BUG abinit crashes for paral_kgb=1
Posted: Fri Nov 10, 2017 11:30 am
This job runs OK with only k-point parallelization, but crashes for all sets of other parallelization I have tried. I used
autoparal 1
paral_kgb 1
max_ncpus=128
to find recommended sets of parameters, and they all crash for the latest version 8.6.1 as well as earlier versions.
This is with the latest Intel compilers and mkl (2018)
as well as earlier versions of intel.
A sample traceback from:
autoparal 0
paral_kgb 1
#max_ncpus=128
npband 4
bandpp 1
npkpt 32
npfft 1
gives
getcut: wavevector= 0.0000 0.0000 0.0000 ngfft= 80 120 90
ecut(hartree)= 50.000 => boxcut(ratio)= 1.96423
ITER STEP NUMBER 1
vtorho : nnsclo_now=2, note that nnsclo,dbl_nnsclo,istep=0 0 1
You should try to get npband*bandpp= 96
For information matrix size is 58376
[48] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[48] ERROR: Fatal signal 11 (SIGSEGV) raised.
[48] ERROR: Signal was encountered at:
[48] ERROR: <<no file/line information found>>
[48] ERROR: After leaving:
[48] ERROR: mpi_alltoallv_(*sendbuf=0x2b3f5b191b00, *sendcounts=0x2b3f4ecc9aa0, *sdispls=0x2b3f4ecc9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b3f5ab4d4c0, *recvcounts=0x2b3f4ecc9a20, *rdispls=0x2b3f4ecc9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7ffc076ebbb0->MPI_SUCCESS)
[51] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[51] ERROR: Fatal signal 11 (SIGSEGV) raised.
[51] ERROR: Signal was encountered at:
[51] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[51] ERROR: <<1 stack level with no information>>
[51] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[51] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.:
128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[51] ERROR: After leaving:
[51] ERROR: mpi_alltoallv_(*sendbuf=0x2b8630f75b00, *sendcounts=0x2b86247ddaa0, *sdispls=0x2b86247dda80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b8630932000, *recvcounts=0x2b86247dda20, *rdispls=0x2b86247dda40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7fffd4965bb0->MPI_SUCCESS)
[63] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[63] ERROR: Fatal signal 11 (SIGSEGV) raised.
[63] ERROR: Signal was encountered at:
[63] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[63] ERROR: <<1 stack level with no information>>
[63] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[63] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[63] ERROR: After leaving:
[63] ERROR: mpi_alltoallv_(*sendbuf=0x2ab417862100, *sendcounts=0x2ab40b555aa0, *sdispls=0x2ab40b555a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab417224100, *recvcounts=0x2ab40b555a20, *rdispls=0x2ab40b555a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 C:
RT_SUB CART_CREATE COMM_WORLD [60:63], *ierr=0x7ffd685fbd30->MPI_SUCCESS)
[0] WARNING: starting premature shutdown
[112] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[112] ERROR: Fatal signal 11 (SIGSEGV) raised.
[112] ERROR: Signal was encountered at:
[112] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[112] ERROR: <<1 stack level with no information>>
[112] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[112] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[112] ERROR: After leaving:
[112] ERROR: mpi_alltoallv_(*sendbuf=0x2b2ab24f0100, *sendcounts=0x2b2aa5ee9aa0, *sdispls=0x2b2aa5ee9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b2ab1eb2100, *recvcounts=0x2b2aa5ee9a20, *rdispls=0x2b2aa5ee9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [112:115], *ierr=0x7ffca337fab0->MPI_SUCCESS)
autoparal 1
paral_kgb 1
max_ncpus=128
to find recommended sets of parameters, and they all crash for the latest version 8.6.1 as well as earlier versions.
This is with the latest Intel compilers and mkl (2018)
as well as earlier versions of intel.
A sample traceback from:
autoparal 0
paral_kgb 1
#max_ncpus=128
npband 4
bandpp 1
npkpt 32
npfft 1
gives
getcut: wavevector= 0.0000 0.0000 0.0000 ngfft= 80 120 90
ecut(hartree)= 50.000 => boxcut(ratio)= 1.96423
ITER STEP NUMBER 1
vtorho : nnsclo_now=2, note that nnsclo,dbl_nnsclo,istep=0 0 1
You should try to get npband*bandpp= 96
For information matrix size is 58376
[48] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[48] ERROR: Fatal signal 11 (SIGSEGV) raised.
[48] ERROR: Signal was encountered at:
[48] ERROR: <<no file/line information found>>
[48] ERROR: After leaving:
[48] ERROR: mpi_alltoallv_(*sendbuf=0x2b3f5b191b00, *sendcounts=0x2b3f4ecc9aa0, *sdispls=0x2b3f4ecc9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b3f5ab4d4c0, *recvcounts=0x2b3f4ecc9a20, *rdispls=0x2b3f4ecc9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7ffc076ebbb0->MPI_SUCCESS)
[51] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[51] ERROR: Fatal signal 11 (SIGSEGV) raised.
[51] ERROR: Signal was encountered at:
[51] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[51] ERROR: <<1 stack level with no information>>
[51] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[51] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.:
128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[51] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[51] ERROR: After leaving:
[51] ERROR: mpi_alltoallv_(*sendbuf=0x2b8630f75b00, *sendcounts=0x2b86247ddaa0, *sdispls=0x2b86247dda80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b8630932000, *recvcounts=0x2b86247dda20, *rdispls=0x2b86247dda40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7fffd4965bb0->MPI_SUCCESS)
[63] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[63] ERROR: Fatal signal 11 (SIGSEGV) raised.
[63] ERROR: Signal was encountered at:
[63] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[63] ERROR: <<1 stack level with no information>>
[63] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[63] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[63] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[63] ERROR: After leaving:
[63] ERROR: mpi_alltoallv_(*sendbuf=0x2ab417862100, *sendcounts=0x2ab40b555aa0, *sdispls=0x2ab40b555a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab417224100, *recvcounts=0x2ab40b555a20, *rdispls=0x2ab40b555a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 C:
RT_SUB CART_CREATE COMM_WORLD [60:63], *ierr=0x7ffd685fbd30->MPI_SUCCESS)
[0] WARNING: starting premature shutdown
[112] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[112] ERROR: Fatal signal 11 (SIGSEGV) raised.
[112] ERROR: Signal was encountered at:
[112] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[112] ERROR: <<1 stack level with no information>>
[112] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[112] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/libifcoremt.so.5)
[112] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[112] ERROR: After leaving:
[112] ERROR: mpi_alltoallv_(*sendbuf=0x2b2ab24f0100, *sendcounts=0x2b2aa5ee9aa0, *sdispls=0x2b2aa5ee9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b2ab1eb2100, *recvcounts=0x2b2aa5ee9a20, *rdispls=0x2b2aa5ee9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [112:115], *ierr=0x7ffca337fab0->MPI_SUCCESS)