Hi everyone-
I am having a lot of trouble compiling abinit 6.6.3 for fft parallelization and need help! I am using ifort 11.1.075 (the updated version) and I can't get fft parallelization working. I have large jobs (supercell calculations) that I need to run with limited memory, so I'm thinking I need to split the fft work over more processors. If I have any other options, aside from using fft parallelization, please let me know!
My compilation has no problems with k-point parallelizing, but using npband seems to break it. I have tried many things including updating fortran, using mkl, and adding a fortran heap-arrays flag, based on the suggestions in these posts (viewtopic.php?f=2&t=655, viewtopic.php?f=2&t=1000, viewtopic.php?f=3&t=1139&p=3665&hilit=fftw3#p3665). I have also tried using ethernet rather than myrinet, the standard on our cluster. Nothing has worked and I am feeling quite desperate.
Here is the most recent configuration command I have used:
../configure --enable-64bit-flags --enable-mpi --enable-mpi-io --with-linalg-flavor=mkl --with-linalg-incs="-I/share/apps/intel/Compiler/11.1/075/mkl/include" --with-linalg-libs="-L/share/apps/intel/Compiler/11.1/075/mkl/lib/em64t -lmkl_intel_lp64 -lmkl_blacs_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lmkl_blas95_lp64 -lmkl_lapack" --with-mpi-prefix=/share/apps/openmpi-intel11.1.075/ FCFLAGS="-heap-arrays 64"
When running with npband > 1, the code always stops on the first or second iteration, though the error can differ depending on the config setup. The config above gives this error:
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 45 6
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
abinip-6.6.3-i11. 0000000000DDC908 Unknown Unknown Unknown
abinip-6.6.3-i11. 000000000067CD6A Unknown Unknown Unknown
abinip-6.6.3-i11. 00000000006149C6 Unknown Unknown Unknown
abinip-6.6.3-i11. 00000000005FAA33 Unknown Unknown Unknown
If anyone can help me I would greatly appreciate it! Thanks in advance.
Best,
Aaron Wolf
FFT Parallelization not working with 6.6.3 + ifort
Moderators: fgoudreault, mcote
Forum rules
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
Please have a look at ~abinit/doc/config/build-config.ac in the source package for detailed and up-to-date information about the configuration of Abinit 8 builds.
For a video explanation on how to build Abinit 7.x for Linux, please go to: http://www.youtube.com/watch?v=DppLQ-KQA68.
IMPORTANT: when an answer solves your problem, please check the little green V-like button on its upper-right corner to accept it.
-
- Posts: 3
- Joined: Tue Jun 14, 2011 6:55 pm
Re: FFT Parallelization not working with 6.6.3 + ifort
It might just be that you try to allocate more memory than there is on your system. And Abinit tends to underestimate the total amount of memroy needed.
Try to decrease the values of the input parameters and check whether it keeps on crashing.
Another way could be to change the nodes / cpu-per-node ratio, in order to give more memory to each processor.
Try to decrease the values of the input parameters and check whether it keeps on crashing.
Another way could be to change the nodes / cpu-per-node ratio, in order to give more memory to each processor.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain
Simune Atomistics
Donostia-San Sebastián, Spain
-
- Posts: 3
- Joined: Tue Jun 14, 2011 6:55 pm
Re: FFT Parallelization not working with 6.6.3 + ifort
Based on Yann's suggestions, I performed the same tests but with twice the memory allocated.To do this, I halved the cpu-per-node ratio to double the memory available to each processor. These runs were allocated over twice the estimated memory requirement given by abinit! (how innacurate are those estimates?) Unfortunately, this still resulted in an error, however the errors were different depending on the value of npband and bandpp. I ran the same input file ( see attatched) with only those values changed.
for the larger run: npband 4 bandpp 1
for the smaller run: npband 2 bandpp 2
Below I have pasted the end of the failed log files from these two runs. I really appreciate help anyone can give me!
Best,
Aaron Wolf
========== Larger Run error =============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 4
*** glibc detected *** /home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2: double free or corruption (!prev): 0x00002aac8a771ea0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d11a722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3d11a7273b]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b4327820ae6]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b432781c709]
/share/apps/openmpi-intel11.1.075/lib/libmpi.so.0(MPI_Allreduce+0x76)[0x2b43220f0796]
/share/apps/openmpi-intel11.1.075/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5)[0x2b4321e91725]
/home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2(m_xmpi_mp_xsum_mpi_dp4d_+0x344)[0x15d1754]
and more junk like this...
============= Smaller Run Error ============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2
-P-0000 leave_test : synchronization done...
vtorho: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...
*********** RHOIJ (atom 1) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
... only 12 components have been written...
*********** RHOIJ (atom 160) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
Total charge density [el/Bohr^3]
, Maximum= NaN at reduced coord. 0.0000 0.0000 0.0000
, Minimum= NaN at reduced coord. 0.0000 0.0000 0.0000
ETOT 1 NaN NaN 6.138E-02 NaN NaN NaN
scprqt: <Vxc>= -5.2556931E-01 hartree
Simple mixing update:
residual square of the potential : NaN
****** TOTAL Dij in Ha (atom 1) *****
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
... only 12 components have been written...
ITER STEP NUMBER 2
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 2
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2
m_eigen.F90:294:WARNING
Problem in xev, info= 3
WARNING in zpotrf, info= 1
m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1
m_eigen.F90:294:WARNING
Problem in xev, info= 11
condition number of the Gram matrix= NaN
Lobpcgwf: restart performed
m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1
Then this repeats forever...
for the larger run: npband 4 bandpp 1
for the smaller run: npband 2 bandpp 2
Below I have pasted the end of the failed log files from these two runs. I really appreciate help anyone can give me!
Best,
Aaron Wolf
========== Larger Run error =============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 4
*** glibc detected *** /home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2: double free or corruption (!prev): 0x00002aac8a771ea0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3d11a722ef]
/lib64/libc.so.6(cfree+0x4b)[0x3d11a7273b]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b4327820ae6]
/share/apps//openmpi-intel11.1.075/lib/openmpi/mca_coll_tuned.so[0x2b432781c709]
/share/apps/openmpi-intel11.1.075/lib/libmpi.so.0(MPI_Allreduce+0x76)[0x2b43220f0796]
/share/apps/openmpi-intel11.1.075/lib/libmpi_f77.so.0(MPI_ALLREDUCE+0xc5)[0x2b4321e91725]
/home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2(m_xmpi_mp_xsum_mpi_dp4d_+0x344)[0x15d1754]
and more junk like this...
============= Smaller Run Error ============
ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2
-P-0000 leave_test : synchronization done...
vtorho: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...
*********** RHOIJ (atom 1) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ...
... only 12 components have been written...
*********** RHOIJ (atom 160) **********
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
Total charge density [el/Bohr^3]
, Maximum= NaN at reduced coord. 0.0000 0.0000 0.0000
, Minimum= NaN at reduced coord. 0.0000 0.0000 0.0000
ETOT 1 NaN NaN 6.138E-02 NaN NaN NaN
scprqt: <Vxc>= -5.2556931E-01 hartree
Simple mixing update:
residual square of the potential : NaN
****** TOTAL Dij in Ha (atom 1) *****
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
... only 12 components have been written...
ITER STEP NUMBER 2
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 2
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 135 2
m_eigen.F90:294:WARNING
Problem in xev, info= 3
WARNING in zpotrf, info= 1
m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1
m_eigen.F90:294:WARNING
Problem in xev, info= 11
condition number of the Gram matrix= NaN
Lobpcgwf: restart performed
m_eigen.F90:383:WARNING
Problem in xgv, info= 7
WARNING in zpotrf, info= 1
WARNING in zpotrf, info= 1
Then this repeats forever...
- Attachments
-
- pv_120gpa_222_bigTest.in
- (18.82 KiB) Downloaded 266 times
Re: FFT Parallelization not working with 6.6.3 + ifort
*** glibc detected *** /home/awolf/bin/abinip-6.6.3-i11.1-mkl-R2: double free or corruption (!prev): 0x00002aac8a771ea0 ***
This likely means that there is a memory leak or that the amount of memory you need is really huge.
Seeing the NaNs in the output of Dij, I would say that there might be a problem related to PAW, but I can't be more precise with the data you provided.
Please remember as well that some versions of MKL are hosting some peculiar bugs. You might find useful information in the backtrace, leading you to a MKL call, for instance.
One way to determine whether it comes from MKL would be to recompile Abinit with FFTW3 and the internal linear algebra libraries. If your calculation proceeds, it will mean that MKL is the culprit.
Yann Pouillon
Simune Atomistics
Donostia-San Sebastián, Spain
Simune Atomistics
Donostia-San Sebastián, Spain