Segmentation faults in parallel version of abinit

option, parallelism,...

Moderators: fgoudreault, mcote

Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Locked
Boris
Posts: 128
Joined: Tue Feb 16, 2010 10:13 am
Location: France

Segmentation faults in parallel version of abinit

Post by Boris » Tue Apr 12, 2011 12:19 am

Dear all

I am facing several issues with the parallel version of abinit 6.1 and 6.2. When I use kpoint parallelisation only (paral_kgb=0), abinit crashes with a segmentation fault each time I use a number of cpus that is not a multiple of nkpt*nsppol (abinit denotes this as an inefficient parallelisation).

The segmentation fault error is as follows:

Code: Select all

 Biggest array : cg(disk), with    248.1758 MBytes.
-P-0000  leave_test : synchronization done...
 memana : allocated an array of    248.176 Mbytes, for testing purposes.
 memana : allocated     652.902 Mbytes, for testing purposes.
 The job will continue.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 35 in communicator MPI_COMM_WORLD
with errorcode 14.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 35 with PID 15321 on
node cja614 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libc.so.6          00002B7FCE839212  Unknown               Unknown  Unknown
libmpi.so.0        00002B7FCC13ACE5  Unknown               Unknown  Unknown
libpthread.so.0    00002B7FCE55673D  Unknown               Unknown  Unknown
libc.so.6          00002B7FCE83FF6D  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002B9E9F4CF32F  Unknown               Unknown  Unknown
libifcore.so.5     00002B9E9E9CE673  Unknown               Unknown  Unknown
libifcore.so.5     00002B9E9E95385F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002B9E9F6FA994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002B211AA0632F  Unknown               Unknown  Unknown
libifcore.so.5     00002B2119F05673  Unknown               Unknown  Unknown
libifcore.so.5     00002B2119E8A85F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002B211AC31994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002AE3655D632F  Unknown               Unknown  Unknown
libifcore.so.5     00002AE364AD5673  Unknown               Unknown  Unknown
libifcore.so.5     00002AE364A5A85F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002AE365801994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002AD826E3D32F  Unknown               Unknown  Unknown
libifcore.so.5     00002AD82633C673  Unknown               Unknown  Unknown
libifcore.so.5     00002AD8262C185F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002AD827068994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002B1A3DFDC32F  Unknown               Unknown  Unknown
libifcore.so.5     00002B1A3D4DB673  Unknown               Unknown  Unknown
libifcore.so.5     00002B1A3D46085F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002B1A3E207994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002B8B96A3A32F  Unknown               Unknown  Unknown
libifcore.so.5     00002B8B95F39673  Unknown               Unknown  Unknown
libifcore.so.5     00002B8B95EBE85F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002B8B96C65994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread.so.0    00002ADA5BC6C32F  Unknown               Unknown  Unknown
libifcore.so.5     00002ADA5B16B673  Unknown               Unknown  Unknown
libifcore.so.5     00002ADA5B0F085F  Unknown               Unknown  Unknown
abinit-6.6.2       00000000014F44F6  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000F49462  Unknown               Unknown  Unknown
abinit-6.6.2       00000000004057C1  Unknown               Unknown  Unknown
abinit-6.6.2       000000000040510C  Unknown               Unknown  Unknown
libc.so.6          00002ADA5BE97994  Unknown               Unknown  Unknown
abinit-6.6.2       0000000000405019  Unknown               Unknown  Unknown
[cja557.localdomain:18525] 4 more processes have sent help message help-mpi-api.txt / mpi-abort
[cja557.localdomain:18525] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages


Then, if I switch the paral_kgb keyword to 1, whatever the number of cpus allocated for kpoint, abinit crashes with another segmentation fault. The only difference is that abinit starts the first SCF step but crashes during the lobpgc procedure.

I don't have much information about this. Could that be related to my MPI implementation? Also, I compiled abinit using --with-linalg-flavor=mkl but it also occurs with the linalg plugin that can be downloaded from the abinit website.

Thank you very much for your help

Boris
----------------------------------------------------------
Boris Dorado
Atomic Energy Commission
France
----------------------------------------------------------

mverstra
Posts: 655
Joined: Wed Aug 19, 2009 12:01 pm

Re: Segmentation faults in parallel version of abinit

Post by mverstra » Fri Apr 22, 2011 2:35 pm

Hello,

1) please try a more recent version (6.6) - the parallelism has been improved significantly

2) this can happen for certain compilers (in particular ifort), but not systematically, if you use a number of processors with an idle proc: 3 k-points and 4 processors, or 6 k-points and 4 processors (which will be filled as 2 2 2 0)

You should be able to choose a number of procs which is suited to your number of k-points, can't you?

Matthieu
Matthieu Verstraete
University of Liege, Belgium

Boris
Posts: 128
Joined: Tue Feb 16, 2010 10:13 am
Location: France

Re: Segmentation faults in parallel version of abinit

Post by Boris » Fri Apr 22, 2011 4:28 pm

Hello Matthieu,

When I wrote I was using the 6.1 and 6.2 versions of abinit, I actually meant 6.6.1 and 6.6.2 :) Sorry about that.

I compiled abinit using mpif90. And I cannot always chose a number of procs suited to the number of kpoints because all nodes from the computer has 8 cores, so to be efficient the number of kpoints has to be a multiple of 8, right?

Thank you

Bors
----------------------------------------------------------
Boris Dorado
Atomic Energy Commission
France
----------------------------------------------------------

hicpalm
Posts: 44
Joined: Tue Feb 09, 2010 4:33 pm

Re: Segmentation faults in parallel version of abinit

Post by hicpalm » Fri Apr 22, 2011 7:11 pm

Hi,
the number of kpoints has to be a multiple of 8

I think with mpirun you have the possibility to specify the number of processors to be used. Even if you have 8 core you are not constrained to use all of them at a time. something like mpirun -np 3 will run with only 3 procs.

Boris
Posts: 128
Joined: Tue Feb 16, 2010 10:13 am
Location: France

Re: Segmentation faults in parallel version of abinit

Post by Boris » Fri Apr 22, 2011 8:14 pm

hicpalm wrote:Hi,
the number of kpoints has to be a multiple of 8

I think with mpirun you have the possibility to specify the number of processors to be used. Even if you have 8 core you are not constrained to use all of them at a time. something like mpirun -np 3 will run with only 3 procs.


I thought that once you had a node allocated, all the cores were constrained to be used at the same time.

Thanks for that

Boris
----------------------------------------------------------
Boris Dorado
Atomic Energy Commission
France
----------------------------------------------------------

mverstra
Posts: 655
Joined: Wed Aug 19, 2009 12:01 pm

Re: Segmentation faults in parallel version of abinit

Post by mverstra » Thu Apr 28, 2011 5:12 pm

it depends on the mpi implementation. In general you add the -np #proc argument yourself, and then you can choose whatever. For slurm, for instance, once you have submitted a script for 47 processors you just run srun and it launches with those processors. Normally you can specify any number (not just multiples of 8) and the rest should work. Don't know about your machine...

Matthieu
Matthieu Verstraete
University of Liege, Belgium

hhwj340
Posts: 20
Joined: Mon Jan 04, 2010 10:42 am

Re: Segmentation faults in parallel version of abinit

Post by hhwj340 » Sun Jun 05, 2011 3:45 pm

I also meet this problem.
I have publised a post which gives detailed information about my problem.

Hi,Boris! Have you solved this problem. Could you tell me how you deal with this problem?

Dear Matthieu,I have use your suggention, however,it doesn't work.
I also use ifort to compile abinit. The version of ifort is 11.1.

Locked