Page 1 of 1

paral_kgb options

Posted: Mon Dec 06, 2010 2:31 am
by natalie
I have very little experience with running abinit in the parallel mode. I did recently install 6.4.2 and did check one of the parallel jobs in the test suite and also successfully ran a parallel job with bccLi, but the input file for black phosphorus is giving the following error:

ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
*** glibc detected *** free(): invalid pointer: 0x0000002ab7b39010 ***
*** glibc detected *** free(): invalid pointer: 0x0000002ab7b39010 ***
*** glibc detected *** free(): invalid pointer: 0x0000002ab7b39010 ***
*** glibc detected *** free(): invalid pointer: 0x0000002ab7b39010 ***
mpiexec: Warning: tasks 0-3 died with signal 6 (Aborted).

I am thinking that I have not defined a parameter that is needed??? The input file is given below. I will be glad to provide any additional files if needed. I set up the pbs to run with 4 nodes and the execute command was:

mpiexec abinit <P.files

Thanks in advance for any suggestions,
Natalie Holzwarth
Department of Physics, Wake Forest University, Winston-Salem, NC 27106 USA

---P.in-----

ecut 32.00
pawecutdg 64.

#Structural relaxation
ionmov 2
optcell 2
ecutsm 0.5 Ha
dilatmx 1.8
ntime 100

spgroup 64
brvltt -1
acell 3.3117 10.158 4.243 angstrom



nstep 40
toldfe 1.0d-10
nband 40
occopt 7 tsmear 5.0d-4
#iscf 14

#Definition of the atom types
ntypat 1
znucl 15


#Definition of the atoms
natom 4
natrd 1
typat 1
xred 0.00000 0.10540 0.07470


#Definition of the k-point grid
kptopt 1
ngkpt 4 4 3
nshiftk 1
shiftk 0.5 0.5 0.5

prtwf 0
prtden 0

#parallel
paral_kgb 1
npband 1
npfft 1
npkpt 4
wfoptalg 4
nloalg 4
fftalg 401
intxc 0
fft_opt_lob 2

---------end of P.in------------------

Re: paral_kgb options

Posted: Mon Dec 06, 2010 7:58 pm
by Alain_Jacques
Hello Natalie,

This sounds like a bug - Abinit seems to free memory that has gone already. So even if a parameter could be wrong or missing (the input looks fine to me), it should not crash with such a memory corruption. Depending on your glibc version, the rest may work or not. I will try to reproduce this behavior on my system but would you be so kind to provide some extra debugging information.

Try to "prepend"

Code: Select all

MALLOC_CHECK_=0
to your parallel mpiexec launch and see if Abinit goes further (glib should ignore heap corruption and let Abinit continue with a risk of memory leakage). If you cannot control that all the parallel slots are located on the same node, add this variable to your parallel environment to have it propagated to all the running nodes.
If MALLOC_CHECK_ is set to 1 as in

Code: Select all

MALLOC_CHECK_=1 mpiexec ...
glibc should provide more information and let Abinit run up to the end (or segfault). Does it work - any relevant info?

What Linux variant are you running? (glibc version?) What version of Fortran and MPICH2 were used to compile Abinit?

Kind regards,

Alain

Re: paral_kgb options

Posted: Mon Dec 06, 2010 10:40 pm
by natalie
Dear Alain,
I tried to rerun the job with the two different values of MALLOC_CHECK_=0 or MALLOC_CHECK_=1 and the results look identical to me.
I am using intel 11.1 for fortran and mpich2 and the compiler linked to the following libraries:
FC_LIBS=" -L/opt/intel111-libs/mpich2-1.0.8p1/lib -lmpichf90 -lmpich -lpthread -lrt -L/system0/opt/intel/Compiler/11.1/072/lib/intel64 -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6 -L/usrlib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64-L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../.. -L/lib64 -L/lib -L/usr/lib64 -L/usr/lib -lifport -lifcore -limf -lsvml -lm -lipgo -lirc -lirc_s -ldl"
This was generated automatically during the ./configure step. As I mentioned, the program does work in some cases so it cannot be completely wrong. On the other hand, I did notice that the optimization level was rather high -- FCFLAGS="-O3 -xW -vec-report0" -- perhaps that is a bad idea. Thank you for offering to try reproduce the error. I will be glad to send you the PAW pseudopotential file if that would be helpful. Thanks, Natalie

Re: paral_kgb options

Posted: Mon Dec 06, 2010 11:38 pm
by Alain_Jacques
Dear Natalie,

O3 should be alright - the routine test procedure uses O2 but I have no trouble with O3 and previous Abinit releases - from the accuracy and stability point of view.
Anything that helps me to be in the same conditions as you are welcomed so I'll gladly use your PAW pseudo - you don't need to upload the file if it is the same as the one on your pwpaw table.

Kind regards,

Alain

Re: paral_kgb options

Posted: Mon Dec 06, 2010 11:48 pm
by natalie
The PAW function should be very similar to the one on the web page, but we changed a few parameters and have not updated the web page since Marc Torrent improved the code. I will upload a bzip2 file. Natalie

Re: paral_kgb options

Posted: Tue Dec 07, 2010 12:11 am
by natalie
I failed to upload the file to the forum. But it is available on the unlinked webpage http://www.wfu.edu/~natalie/papers/pwpa ... abinit.bz2

Re: paral_kgb options

Posted: Tue Dec 07, 2010 9:42 pm
by Alain_Jacques
Hello Natalie,

Thanks for the pseudo. Uploading of large files is probably deactivated.

Just a small question ... if you comment out the paral_kgb, npband, ... parallel options in your P.in i.e. change the last section to

Code: Select all

#parallel
#paral_kgb 1
#npband 1
#npfft 1
#npkpt 4
wfoptalg 4
nloalg 4
fftalg 401
intxc 0
fft_opt_lob 2
and launch with a mpiexec -np 4 abinit ..., does it run - in parallel - without crashing? I thing it's paral_kgb 1 that is toxic here.
I admit that I never use it in this context. I use for example a paral_kgb -4 during a sequential Abinit run to output suggestions about parallelism (here on up to 4 slots) and then comment it out during the actual parallel job. But I agree that paral_kgb 1 should work according the documentation. I'll check further.

Kind regards,

Alain

Re: paral_kgb options

Posted: Tue Dec 07, 2010 10:07 pm
by natalie
Dear Alain,
It does seem to be running in parallel wth your most recent suggestion. It will take a while to finish, but it looks like this is a "fix"? Did I define the variables in the wrong order or what do you suggest in general? Is nkpt=#processes the default? In any case, Thanks, Natalie

Re: paral_kgb options

Posted: Tue Dec 07, 2010 10:24 pm
by Alain_Jacques
The idea is that if you run a - sequential or parallel - Abinit with paral_kgb -100, it will output
WARNING in invars1m For dataset= 1 a possible choice for less than 100 processors is:

nproc npkpt npband npfft bandpp weight
96 12 4 2 2 0.50
96 12 8 1 1 1.00
60 12 5 1 4 1.00
48 12 2 2 4 0.25
48 12 4 1 2 1.00
24 12 2 1 4 1.00
invars1m : launch a parallel version of ABINIT with a number of processor among the above list, and the associated input variables
npkpt, npband, npfft and bandpp. The optimal weight is close to 1.
and then stop. Then npkpt, npband, ... and -np=nproc are to be adjusted considering the number of available CPUs and the optimal weight. then I (luckily) get rid of that variable.

Alain

Re: paral_kgb options

Posted: Wed Dec 08, 2010 10:11 am
by torrent
Natalie,

Just a little remark:

All these options:
wfoptalg 4
nloalg 4
fftalg 401
intxc 0
fft_opt_lob 2


should be set by default in the future 6.6 version of Abinit.

Also, if you put wfoptalg=14, you get more efficient runs (this will be the default in v6.6).
But this has no link with your initial problem.


Marc

Re: paral_kgb options

Posted: Wed Dec 08, 2010 2:53 pm
by natalie
Thanks Alain and Marc! Your help is very much appreciated. Natalie

Re: paral_kgb options

Posted: Mon Feb 21, 2011 1:29 pm
by bvcpbk
It seems to me that the reason for the observed behaviour is line 314 in 66_wfs/prep_getghc.F90 (I'm referring to the production 6.4.3 code):

allocate(swavef_alltoall_sym(2,(ndatarecv_tot*bandpp_sym)*iscalc))

with iscalc = 0 (set previously at line 186). This yields a zero-sized 2nd dimension, which seems to disturb intel compilers.
Changing this line to

allocate(swavef_alltoall_sym(2,ndatarecv_tot*bandpp_sym))

seems to fix the problem.

Cheers, BK

Re: paral_kgb options

Posted: Mon Feb 21, 2011 5:22 pm
by Alain_Jacques
Dear BK,

Thanks for the debugging. Fixed in the upcoming release.

Alain

Re: paral_kgb options

Posted: Tue Feb 22, 2011 12:21 pm
by torrent
Dear Alain,

(thanks to BK for the debugging)

I'm not sure this correction is the optimal one... because of memory considerations.
In that level of the code, we absolutely have to save to memory.
and the proposed code modification introduces an unused array which can have a large size.

Instead of

Code: Select all

allocate(swavef_alltoall_sym(2,(ndatarecv_tot*bandpp_sym)*iscalc))


...I would propose

Code: Select all

if (iscalc>0) then
  allocate(swavef_alltoall_sym(2,ndatarecv_tot*bandpp_sym))
else
  allocate(swavef_alltoall_sym(1,1))
endif


This is for sure not elegant at all... but it's saves memory and avoid the use of the zero-sized array.

Do you agree with this ?

A bientôt,
Marc

Re: paral_kgb options

Posted: Tue Feb 22, 2011 2:53 pm
by Alain_Jacques
Hello Marc,

I was glad to see that 66_wfs/prep_getghc.F90 allocations were already modified in 6.6.1 but you're right about the memory size issue and your solution is definitely more efficient (from the economy and compliance with Intel's compiler - not ugly at all :-) ). I don't see any other gotcha in the rest of the routine.

I'm somewhat puzzled by this behavior especially considering that zero sized arrays are allowed thanks to flexible array members within ISO99 C standard even on Intel's compilers. Anyway there are several other places in Abinit with similar structures that could be problematic. Don't know if anyone already tried to modify them - it's a bit awkward that it cannot be detected early. I'll have a look on the compiler's manual to see if there is an option that affects this behavior.

Amicalement,

Alain

Re: paral_kgb options

Posted: Tue Feb 22, 2011 3:07 pm
by bvcpbk
Btw there is a somehow similar problem in 79_seqpar_mpi/vtorho.F90.
Array buffer2 is allocated by using the variable mb2dkpsp. The latter is initialized or not, depending on context. When using uninitialized while allocating buffer2, arbitrary effects are seen (SIGSEGV, unbalanced MPI barriers and the like). One gets pointed to the corresponding line with intel compilers after compiling with -ftrapuv.

The following patch (abinit 6.4.3) fixes this:

Code: Select all

Index: src/79_seqpar_mpi/vtorho.F90
===================================================================
RCS file: /gfs2/work/bzfbbk/CVS/ABINIT/abinit-6.4.3/src/79_seqpar_mpi/vtorho.F90,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 vtorho.F90
--- src/79_seqpar_mpi/vtorho.F90   2 Feb 2011 14:34:55 -0000   1.1.1.1
+++ src/79_seqpar_mpi/vtorho.F90   22 Feb 2011 13:59:22 -0000
@@ -1216,7 +1216,8 @@
 
 !    If needed, exchange the values of eigen,resid,eknk,enlnk,grnlnk
      allocate(buffer1((4+3*natom*optforces-psps%usepaw)*mbdkpsp))
-     allocate(buffer2(mb2dkpsp*paw_dmft%use_dmft))
+     if(paw_dmft%use_dmft==1) &
+&       allocate(buffer2(mb2dkpsp*paw_dmft%use_dmft))
 !    Pack eigen,resid,eknk,enlnk,grnlnk in buffer1
      buffer1(1          :  mbdkpsp)=eigen(:)
      buffer1(1+  mbdkpsp:2*mbdkpsp)=resid(:)
@@ -1287,6 +1288,7 @@
        grnlnk(:,:)=reshape(buffer1(index1+1:index1+3*natom*mbdkpsp),&
 &       (/ 3*natom , mbdkpsp /) )
      end if
+     if(allocated(buffer2)) deallocate(buffer2)
      deallocate(buffer1)
      call timab(29,2,tsec)


However, I did not take a look into newer abinit releases if the problem still exists.

Cheers BK

Re: paral_kgb options

Posted: Sat Feb 26, 2011 9:34 am
by mverstra
Thanks BK! This has been incorporated into 6.6 (soon to be released patch)

Matthieu