Page 1 of 1

FFT Parallelization not working with 6.6.3 + ifort

Posted: Fri Jun 24, 2011 5:49 pm
by wolf.aarons
Hi everyone-

I am having a lot of trouble compiling abinit 6.6.3 for fft parallelization and need help! I am using ifort 11.1.075 (the updated version) and I can't get fft parallelization working. I have large jobs (supercell calculations) that I need to run with limited memory, so I'm thinking I need to split the fft work over more processors. If I have any other options, aside from using fft parallelization, please let me know!

My compilation has no problems with k-point parallelizing, but using npband seems to break it. I have tried many things including updating fortran, using mkl, and adding a fortran heap-arrays flag, based on the suggestions in these posts (viewtopic.php?f=2&t=655, viewtopic.php?f=2&t=1000, viewtopic.php?f=3&t=1139&p=3665&hilit=fftw3#p3665). I have also tried using ethernet rather than myrinet, the standard on our cluster. Nothing has worked and I am feeling quite desperate.

Here is the most recent configuration command I have used:
../configure --enable-64bit-flags --enable-mpi --enable-mpi-io --with-linalg-flavor=mkl --with-linalg-incs="-I/share/apps/intel/Compiler/11.1/075/mkl/include" --with-linalg-libs="-L/share/apps/intel/Compiler/11.1/075/mkl/lib/em64t -lmkl_intel_lp64 -lmkl_blacs_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lmkl_blas95_lp64 -lmkl_lapack" --with-mpi-prefix=/share/apps/openmpi-intel11.1.075/ FCFLAGS="-heap-arrays 64"

When running with npband > 1, the code always stops on the first or second iteration, though the error can differ depending on the config setup. The config above gives this error:
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 45 6
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
abinip-6.6.3-i11. 0000000000DDC908 Unknown Unknown Unknown
abinip-6.6.3-i11. 000000000067CD6A Unknown Unknown Unknown
abinip-6.6.3-i11. 00000000006149C6 Unknown Unknown Unknown
abinip-6.6.3-i11. 00000000005FAA33 Unknown Unknown Unknown

If anyone can help me I would greatly appreciate it! Thanks in advance.

Best,
Aaron Wolf

Re: FFT Parallelization not working with 6.6.3 + ifort

Posted: Mon Jun 27, 2011 11:31 am
by pouillon
It might just be that you try to allocate more memory than there is on your system. And Abinit tends to underestimate the total amount of memroy needed.

Try to decrease the values of the input parameters and check whether it keeps on crashing.

Another way could be to change the nodes / cpu-per-node ratio, in order to give more memory to each processor.

Re: FFT Parallelization not working with 6.6.3 + ifort

Posted: Tue Jun 28, 2011 2:33 am
by wolf.aarons
Based on Yann's suggestions, I performed the same tests but with twice the memory allocated.To do this, I halved the cpu-per-node ratio to double the memory available to each processor. These runs were allocated over twice the estimated memory requirement given by abinit! (how innacurate are those estimates?) Unfortunately, this still resulted in an error, however the errors were different depending on the value of npband and bandpp. I ran the same input file ( see attatched) with only those values changed.

for the larger run: npband 4 bandpp 1
for the smaller run: npband 2 bandpp 2

Below I have pasted the end of the failed log files from these two runs. I really appreciate help anyone can give me!

Best,
Aaron Wolf

========== Larger Run error =============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 4
*** glibc detected *** /home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2: double free or corruption (!prev): 0x00002aac8a771ea0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d11a722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3d11a7273b]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b4327820ae6]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b432781c709]
/share/apps/openmpi-intel11.1.075/lib/libmpi.so.0(MPI_Allreduce+0x76)[0x2b43220f0796]
/share/apps/openmpi-intel11.1.075/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5)[0x2b4321e91725]
/home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2(m_xmpi_mp_xsum_mpi_dp4d_+0x344)[0x15d1754]
and more junk like this...

============= Smaller Run Error ============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2
-P-0000 leave_test : synchronization done...
vtorho: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...

*********** RHOIJ (atom 1) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
... only 12 components have been written...

*********** RHOIJ (atom 160) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

Total charge density [el/Bohr^3]
, Maximum= NaN at reduced coord. 0.0000 0.0000 0.0000
, Minimum= NaN at reduced coord. 0.0000 0.0000 0.0000
ETOT 1 NaN NaN 6.138E-02 NaN NaN NaN
scprqt: <Vxc>= -5.2556931E-01 hartree

Simple mixing update:
residual square of the potential : NaN

****** TOTAL Dij in Ha (atom 1) *****
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
... only 12 components have been written...

ITER STEP NUMBER 2
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 2
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2

m_eigen.F90:294:WARNING
Problem in xev, info= 3
WARNING in zpotrf, info= 1

m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1

m_eigen.F90:294:WARNING
Problem in xev, info= 11
condition number of the Gram matrix= NaN
Lobpcgwf: restart performed

m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1

Then this repeats forever...

Re: FFT Parallelization not working with 6.6.3 + ifort

Posted: Wed Jun 29, 2011 5:24 pm
by pouillon
*** glibc detected *** /home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2: double free or corruption (!prev): 0x00002aac8a771ea0 ***

This likely means that there is a memory leak or that the amount of memory you need is really huge.

Seeing the NaNs in the output of Dij, I would say that there might be a problem related to PAW, but I can't be more precise with the data you provided.

Please remember as well that some versions of MKL are hosting some peculiar bugs. You might find useful information in the backtrace, leading you to a MKL call, for instance.

One way to determine whether it comes from MKL would be to recompile Abinit with FFTW3 and the internal linear algebra libraries. If your calculation proceeds, it will mean that MKL is the culprit.