This job runs OK with only k-point parallelization, but crashes for all sets of other parallelization I have tried. I used
autoparal 1
paral_kgb 1
to find recommended sets of parameters, and they all crash for the latest version 8.6.1 as well as earlier versions.
This is with the latest Intel compilers and mkl (2018)
as well as earlier versions of intel.
A sample traceback from:
autoparal 0
paral_kgb 1
npband 4
bandpp 1
npkpt 32
npfft 1
getcut: wavevector= 0.0000 0.0000 0.0000 ngfft= 80 120 90
ecut(hartree)= 50.000 => boxcut(ratio)= 1.96423
vtorho : nnsclo_now=2, note that nnsclo,dbl_nnsclo,istep=0 0 1
You should try to get npband*bandpp= 96
For information matrix size is 58376
[48] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[48] ERROR: Fatal signal 11 (SIGSEGV) raised.
[48] ERROR: Signal was encountered at:
[48] ERROR: <<no file/line information found>>
[48] ERROR: After leaving:
[48] ERROR: mpi_alltoallv_(*sendbuf=0x2b3f5b191b00, *sendcounts=0x2b3f4ecc9aa0, *sdispls=0x2b3f4ecc9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b3f5ab4d4c0, *recvcounts=0x2b3f4ecc9a20, *rdispls=0x2b3f4ecc9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7ffc076ebbb0->MPI_SUCCESS)
[51] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[51] ERROR: Fatal signal 11 (SIGSEGV) raised.
[51] ERROR: Signal was encountered at:
[51] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[51] ERROR: <<1 stack level with no information>>
[51] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[51] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/
[51] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.:
[51] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[51] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[51] ERROR: After leaving:
[51] ERROR: mpi_alltoallv_(*sendbuf=0x2b8630f75b00, *sendcounts=0x2b86247ddaa0, *sdispls=0x2b86247dda80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b8630932000, *recvcounts=0x2b86247dda20, *rdispls=0x2b86247dda40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [48:51], *ierr=0x7fffd4965bb0->MPI_SUCCESS)
[63] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[63] ERROR: Fatal signal 11 (SIGSEGV) raised.
[63] ERROR: Signal was encountered at:
[63] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[63] ERROR: <<1 stack level with no information>>
[63] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[63] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/
[63] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/
[63] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[63] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[63] ERROR: After leaving:
[63] ERROR: mpi_alltoallv_(*sendbuf=0x2ab417862100, *sendcounts=0x2ab40b555aa0, *sdispls=0x2ab40b555a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2ab417224100, *recvcounts=0x2ab40b555a20, *rdispls=0x2ab40b555a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 C:
RT_SUB CART_CREATE COMM_WORLD [60:63], *ierr=0x7ffd685fbd30->MPI_SUCCESS)
[0] WARNING: starting premature shutdown
[112] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[112] ERROR: Fatal signal 11 (SIGSEGV) raised.
[112] ERROR: Signal was encountered at:
[112] ERROR: __kmp_external__ZN3rml8internal5Block16freePublicObjectEPNS0_10FreeObjectE (/mnt/beegfs/bin/abinit)
[112] ERROR: <<1 stack level with no information>>
[112] ERROR: __kmp_external_scalable_free (/mnt/beegfs/bin/abinit)
[112] ERROR: for_deallocate (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/
[112] ERROR: for_dealloc_all_nocheck (/mnt/beegfs/intel/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin/
[112] ERROR: m_lobpcgwf_mp_getghc_gsc_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcg2_mp_lobpcg_run_ (/mnt/beegfs/bin/abinit)
[112] ERROR: m_lobpcgwf_mp_lobpcgwf2_ (/mnt/beegfs/bin/abinit)
[112] ERROR: After leaving:
[112] ERROR: mpi_alltoallv_(*sendbuf=0x2b2ab24f0100, *sendcounts=0x2b2aa5ee9aa0, *sdispls=0x2b2aa5ee9a80, sendtype=MPI_DOUBLE_PRECISION, *recvbuf=0x2b2ab1eb2100, *recvcounts=0x2b2aa5ee9a20, *rdispls=0x2b2aa5ee9a40, recvtype=MPI_DOUBLE_PRECISION, comm=0xffffffffc4000003 CART_SUB CART_CREATE COMM_WORLD [112:115], *ierr=0x7ffca337fab0->MPI_SUCCESS)
BUG abinit crashes for paral_kgb=1
BUG abinit crashes for paral_kgb=1
- Attachments
- (3.27 KiB) Downloaded 533 times
Re: BUG abinit crashes for paral_kgb=1 *SOLVED*
I recompiled everything with no openmp and it works fine now!
Re: BUG abinit crashes for paral_kgb=1 (*Not Solved*)
I spoke too soon. The self-consistency normally runs with paral_kgb 1 (that even occasionally dies) , but the berry's phase parts (data sets 3 and 4)
always crash with allocation problems with paral_kgp 1 . I tried many different sets of processors, autoparal, and manual setting of parallelization parameters.
It seems to be dying in m_fftcore/kpgsph , but it is hard to get a good traceback as memory is corrupted.
Example errors:
1] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[1] ERROR: Fatal signal 11 (SIGSEGV) raised.
[1] ERROR: Signal was encountered at:
[1] ERROR: m_fftcore_mp_kpgsph_ (/mnt/beegfs/bin/abinit)
[1] ERROR: After leaving:
[1] ERROR: mpi_comm_rank_(comm=MPI_COMM_WORLD, *rank=0x7ffece8cf6c0->1, *ierr=0x7ffece8cf6c4->MPI_SUCCESS)
IO operation completed. cpu_time: 0.1 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000c553f10 ***
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000d24bb60 ***
IO operation completed. cpu_time: 0.0 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
Relative gap for number of plane waves between process (%): 0.24
Relative gap for number of plane waves between process (%): 0.26
Relative gap for number of plane waves between process (%): 0.16
Relative gap for number of plane waves between process (%): 0.16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): memory corruption: 0x000000000d32cba0 ***
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xd2a91a0, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed
MPIDI_CH3_EagerContigIsend(677).......: failure occurred while attempting to send an eager message
MPIDI_CH3_iSendv(37)..................: Communication error with rank 9
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xc37a900, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed
always crash with allocation problems with paral_kgp 1 . I tried many different sets of processors, autoparal, and manual setting of parallelization parameters.
It seems to be dying in m_fftcore/kpgsph , but it is hard to get a good traceback as memory is corrupted.
Example errors:
1] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[1] ERROR: Fatal signal 11 (SIGSEGV) raised.
[1] ERROR: Signal was encountered at:
[1] ERROR: m_fftcore_mp_kpgsph_ (/mnt/beegfs/bin/abinit)
[1] ERROR: After leaving:
[1] ERROR: mpi_comm_rank_(comm=MPI_COMM_WORLD, *rank=0x7ffece8cf6c0->1, *ierr=0x7ffece8cf6c4->MPI_SUCCESS)
IO operation completed. cpu_time: 0.1 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000c553f10 ***
*** Error in `/mnt/beegfs/bin/abinit': malloc(): smallbin double linked list corrupted: 0x000000000d24bb60 ***
IO operation completed. cpu_time: 0.0 [s], walltime: 0.1 [s]
initberry: for direction 1, nkstr = 4, nstr = 16
initberry: for direction 2, nkstr = 4, nstr = 16
initberry: for direction 3, nkstr = 4, nstr = 16
Relative gap for number of plane waves between process (%): 0.24
Relative gap for number of plane waves between process (%): 0.26
Relative gap for number of plane waves between process (%): 0.16
Relative gap for number of plane waves between process (%): 0.16
*** Error in `/mnt/beegfs/bin/abinit': malloc(): memory corruption: 0x000000000d32cba0 ***
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xd2a91a0, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed
MPIDI_CH3_EagerContigIsend(677).......: failure occurred while attempting to send an eager message
MPIDI_CH3_iSendv(37)..................: Communication error with rank 9
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(1552)...................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0xc37a900, count=9216, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(1393).............: fail failed
MPIR_Allreduce_intra(975).............: fail failed
MPIDU_Complete_posted_with_error(1710): Process failed
MPIR_Allreduce_intra(1040)............: fail failed
MPIC_Sendrecv(576)....................: fail failed