Segmentation faults in parallel version of abinit
Posted: Tue Apr 12, 2011 12:19 am
Dear all
I am facing several issues with the parallel version of abinit 6.1 and 6.2. When I use kpoint parallelisation only (paral_kgb=0), abinit crashes with a segmentation fault each time I use a number of cpus that is not a multiple of nkpt*nsppol (abinit denotes this as an inefficient parallelisation).
The segmentation fault error is as follows:
Then, if I switch the paral_kgb keyword to 1, whatever the number of cpus allocated for kpoint, abinit crashes with another segmentation fault. The only difference is that abinit starts the first SCF step but crashes during the lobpgc procedure.
I don't have much information about this. Could that be related to my MPI implementation? Also, I compiled abinit using --with-linalg-flavor=mkl but it also occurs with the linalg plugin that can be downloaded from the abinit website.
Thank you very much for your help
Boris
I am facing several issues with the parallel version of abinit 6.1 and 6.2. When I use kpoint parallelisation only (paral_kgb=0), abinit crashes with a segmentation fault each time I use a number of cpus that is not a multiple of nkpt*nsppol (abinit denotes this as an inefficient parallelisation).
The segmentation fault error is as follows:
Code: Select all
Biggest array : cg(disk), with 248.1758 MBytes.
-P-0000 leave_test : synchronization done...
memana : allocated an array of 248.176 Mbytes, for testing purposes.
memana : allocated 652.902 Mbytes, for testing purposes.
The job will continue.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 35 in communicator MPI_COMM_WORLD
with errorcode 14.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 35 with PID 15321 on
node cja614 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00002B7FCE839212 Unknown Unknown Unknown
libmpi.so.0 00002B7FCC13ACE5 Unknown Unknown Unknown
libpthread.so.0 00002B7FCE55673D Unknown Unknown Unknown
libc.so.6 00002B7FCE83FF6D Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002B9E9F4CF32F Unknown Unknown Unknown
libifcore.so.5 00002B9E9E9CE673 Unknown Unknown Unknown
libifcore.so.5 00002B9E9E95385F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002B9E9F6FA994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002B211AA0632F Unknown Unknown Unknown
libifcore.so.5 00002B2119F05673 Unknown Unknown Unknown
libifcore.so.5 00002B2119E8A85F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002B211AC31994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002AE3655D632F Unknown Unknown Unknown
libifcore.so.5 00002AE364AD5673 Unknown Unknown Unknown
libifcore.so.5 00002AE364A5A85F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002AE365801994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002AD826E3D32F Unknown Unknown Unknown
libifcore.so.5 00002AD82633C673 Unknown Unknown Unknown
libifcore.so.5 00002AD8262C185F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002AD827068994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002B1A3DFDC32F Unknown Unknown Unknown
libifcore.so.5 00002B1A3D4DB673 Unknown Unknown Unknown
libifcore.so.5 00002B1A3D46085F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002B1A3E207994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002B8B96A3A32F Unknown Unknown Unknown
libifcore.so.5 00002B8B95F39673 Unknown Unknown Unknown
libifcore.so.5 00002B8B95EBE85F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002B8B96C65994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 00002ADA5BC6C32F Unknown Unknown Unknown
libifcore.so.5 00002ADA5B16B673 Unknown Unknown Unknown
libifcore.so.5 00002ADA5B0F085F Unknown Unknown Unknown
abinit-6.6.2 00000000014F44F6 Unknown Unknown Unknown
abinit-6.6.2 0000000000F49462 Unknown Unknown Unknown
abinit-6.6.2 00000000004057C1 Unknown Unknown Unknown
abinit-6.6.2 000000000040510C Unknown Unknown Unknown
libc.so.6 00002ADA5BE97994 Unknown Unknown Unknown
abinit-6.6.2 0000000000405019 Unknown Unknown Unknown
[cja557.localdomain:18525] 4 more processes have sent help message help-mpi-api.txt / mpi-abort
[cja557.localdomain:18525] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Then, if I switch the paral_kgb keyword to 1, whatever the number of cpus allocated for kpoint, abinit crashes with another segmentation fault. The only difference is that abinit starts the first SCF step but crashes during the lobpgc procedure.
I don't have much information about this. Could that be related to my MPI implementation? Also, I compiled abinit using --with-linalg-flavor=mkl but it also occurs with the linalg plugin that can be downloaded from the abinit website.
Thank you very much for your help
Boris