mpiio segfault band/k-point/FFT parallelization
Posted: Mon May 31, 2010 6:02 am
Hello Everyone,
I am running a constrained LDA calculation (manually specifying occupation numbers) for a 128 atom supercell. Since the convergence with k-point parallelization is quite slow I would like to take advantage of band/k-point/FFT parallelization. So I used the following input variables in my input file
#parallelization variables
paral_kgb 1
wfoptalg=4
nloalg=4
fftalg=401
intxc=0
fft_opt_lob=2
npkpt 4
npband 9
npfft 2
bandpp 2
#accesswff 1 1 1 1
istwfk 1 1 1 1
I found that since I compiled the code with enable_mpi option accesswff is automatically set to 1 even if I comment it out. (configure options are listed below) The code seem to do doing well as far as electronic convergence is concerned (toldfe is set to 1e-6) as shown below
......
ETOT 38 -1560.2634967021 -1.744E-02 2.442E-05 5.902E+02 8.009E-03 2.116E-02
ETOT 39 -1560.2728614790 -9.365E-03 1.369E-05 8.446E+00 1.519E-02 2.355E-02
ETOT 40 -1560.2710868467 1.775E-03 2.177E-05 1.455E+02 8.362E-03 2.002E-02
ETOT 41 -1560.2735058946 -2.419E-03 1.241E-05 5.197E+00 5.002E-03 2.201E-02
ETOT 42 -1560.2735661510 -6.026E-05 1.888E-05 2.061E+00 2.548E-03 2.093E-02
ETOT 43 -1560.2732355317 3.306E-04 8.588E-06 1.796E+01 3.270E-03 2.138E-02
ETOT 44 -1560.2736186433 -3.831E-04 9.703E-06 8.753E-01 2.021E-03 2.112E-02
ETOT 45 -1560.2736311134 -1.247E-05 8.009E-06 1.148E-01 3.549E-04 2.147E-02
ETOT 46 -1560.2736317200 -6.065E-07 1.044E-05 2.272E-02 2.073E-04 2.141E-02
ETOT 47 -1560.2736323600 -6.400E-07 8.383E-06 9.248E-03 1.722E-04 2.149E-02
but soon after it crashes when it attempts to write the WFK file
*********************************
At SCF step 47, etot is converged :
for the second time, diff in etot= 6.400E-07 < toldfe= 1.000E-06
forstrnps : usepaw= 0
forstrnps: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...
strhar : before mpi_comm, harstr= 1.072006351610150E-002
9.865018747734584E-003 9.871758113503866E-003 1.198027124816839E-007
1.494415344832276E-007 7.404431448065469E-004
strhar : after mpi_comm, harstr= 1.887503908462354E-002
1.887500817960579E-002 1.658736148652544E-002 1.388586001033960E-007
1.243844495641594E-007 -6.792742386575647E-008
strhar : ehart,ucvol= 232.336116098840 26865.6696106559
Cartesian components of stress tensor (hartree/bohr^3)
sigma(1 1)= -3.13906187E-04 sigma(3 2)= -1.47617615E-08
sigma(2 2)= -3.13907380E-04 sigma(3 1)= -3.67408460E-09
sigma(3 3)= -3.49901037E-04 sigma(2 1)= 5.10828585E-09
ioarr: writing density data
ioarr: file name is 4exciteo_DEN
m_wffile.F90:272:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length= 4
ioarr: data written to disk file 4exciteo_DEN
-P-0000 leave_test : synchronization done...
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file 4exciteo_WFK
-P-0000 leave_test : synchronization done...
m_wffile.F90:272:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length= 4
***********************************************************************************************
portion of the log file at the beginning is listed below
************** portion of output from log file *************************************
==== FFT mesh ====
FFT mesh divisions ........................ 216 216 240
Augmented FFT divisions ................... 217 217 240
FFT algorithm ............................. 401
FFT cache size ............................ 16
FFT parallelization level ................. 1
Number of processors in my FFT group ...... 2
Index of me in my FFT group ............... 0
No of xy planes in R space treated by me .. 108
No of xy planes in G space treated by me .. 120
MPI communicator for FFT .................. 0
Value of ngfft(15:18) ..................... 0 0 0 0
getmpw: optimal value of mpw= 33137
getdim_nloc : enter
pspheads(1)%nproj(0:3)= 0 1 1 1
getdim_nloc : deduce lmnmax = 15, lnmax = 3,
lmnmaxso= 15, lnmaxso= 3.
memory : analysis of memory needs
================================================================================
Values of the parameters that define the memory need of the present run
intxc = 0 ionmov = 0 iscf = 7 xclevel = 1
lmnmax = 3 lnmax = 3 mband = 450 mffmem = 1
P mgfft = 240 mkmem = 1 mpssoang= 4 mpw = 33137
mqgrid = 3001 natom = 128 nfft = 5598720 nkpt = 4
nloalg = 4 nspden = 1 nspinor = 1 nsppol = 1
nsym = 1 n1xccc = 2501 ntypat = 3 occopt = 0
================================================================================
P This job should need less than 1841.517 Mbytes of memory.
Rough estimation (10% accuracy) of disk space for files :
WF disk file : 1820.272 Mbytes ; DEN or POT disk file : 42.717 Mbytes.
================================================================================
******************************************************
I have access to three different clusters with different file systems (raid1, bluearc-fc and lusterfs) and I see the same problem everywhere. Another observation is that if the estimated WF file is < 1GB then MPI I/O works fine (similar parallelization variables work) and the code is able to write the WFK properly.
I even tried to set accesswff to 0 (but still using paral_kgb 1) which successfully completes the calculation but the WFK file is not apparently complete since cut3d gives an error.
can someone please suggest ways to solve this issue ? I can post the input file if that will help (I am using TM pseudos).
thank you.
Anurag
I am running ABINIT 6.0.4 compiled on a linux cluster with Intel 10.0 fortran compiler. Build information is listed below. ABINIT was configured as follows:
$ ./configure --prefix=/global/home/users/achaudhry --enable-mpi-fft=yes --enable-mpi-io=yes --enable-fttw=yes --enable-64bit-flags=yes FC=mpif90 F77=mpif90 --enable-scalapack CC=mpicc CXX=mpiCC --with-mpi-runner=/global/software/centos-5.x86_64/modules/openmpi/1.4.1-intel/bin/mpirun --with-mpi-level=2 CC_LIBS=-lmpi CXX_LIBS=-lmpi++ -lmpi --with-fc-vendor=intel --with-linalg-includes=-I/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/include --with-linalg-libs=-L/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/lib/em64t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_intel_solver_lp64 -lmkl_lapack -lmkl_core -lguide -lpthread --with-scalapack-libs=-L/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/lib/em64t -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -liomp5 -lpthread --disable-wannier90
=== Build Information ===
Version : 6.0.4
Build target : x86_64_linux_intel0.0
Build date : 20100520
=== Compiler Suite ===
C compiler : intel10.1
CFLAGS : -I/global/software/centos-5.x86_64/modules/openmpi/1.3.3-intel/include
C++ compiler : intel10.1
CXXFLAGS : -g -O2 -vec-report0
Fortran compiler : intel0.0
FCFLAGS : -g -extend-source -vec-report0
FC_LDFLAGS : -static-libgcc -static-intel
=== Optimizations ===
Debug level : yes
Optimization level : standard
Architecture : intel_xeon
=== MPI ===
Parallel build : yes
Parallel I/O : yes
=== Linear algebra ===
Library type : external
Use ScaLAPACK : yes
=== Plug-ins ===
BigDFT : yes
ETSF I/O : yes
LibXC : yes
FoX : no
NetCDF : yes
Wannier90 : no
=== Experimental features ===
Bindings : no
Error handlers : no
Exports : no
GW double-precision : no
Macroave build : yes
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CPP options activated during the build:
CC_INTEL CXX_INTEL FC_INTEL
HAVE_FC_EXIT HAVE_FC_FLUSH HAVE_FC_GET_ENVIRONMEN...
HAVE_FC_LONG_LINES HAVE_FC_NULL HAVE_MPI
HAVE_MPI2 HAVE_MPI_IO HAVE_SCALAPACK
HAVE_STDIO_H USE_MACROAVE
I am running a constrained LDA calculation (manually specifying occupation numbers) for a 128 atom supercell. Since the convergence with k-point parallelization is quite slow I would like to take advantage of band/k-point/FFT parallelization. So I used the following input variables in my input file
#parallelization variables
paral_kgb 1
wfoptalg=4
nloalg=4
fftalg=401
intxc=0
fft_opt_lob=2
npkpt 4
npband 9
npfft 2
bandpp 2
#accesswff 1 1 1 1
istwfk 1 1 1 1
I found that since I compiled the code with enable_mpi option accesswff is automatically set to 1 even if I comment it out. (configure options are listed below) The code seem to do doing well as far as electronic convergence is concerned (toldfe is set to 1e-6) as shown below
......
ETOT 38 -1560.2634967021 -1.744E-02 2.442E-05 5.902E+02 8.009E-03 2.116E-02
ETOT 39 -1560.2728614790 -9.365E-03 1.369E-05 8.446E+00 1.519E-02 2.355E-02
ETOT 40 -1560.2710868467 1.775E-03 2.177E-05 1.455E+02 8.362E-03 2.002E-02
ETOT 41 -1560.2735058946 -2.419E-03 1.241E-05 5.197E+00 5.002E-03 2.201E-02
ETOT 42 -1560.2735661510 -6.026E-05 1.888E-05 2.061E+00 2.548E-03 2.093E-02
ETOT 43 -1560.2732355317 3.306E-04 8.588E-06 1.796E+01 3.270E-03 2.138E-02
ETOT 44 -1560.2736186433 -3.831E-04 9.703E-06 8.753E-01 2.021E-03 2.112E-02
ETOT 45 -1560.2736311134 -1.247E-05 8.009E-06 1.148E-01 3.549E-04 2.147E-02
ETOT 46 -1560.2736317200 -6.065E-07 1.044E-05 2.272E-02 2.073E-04 2.141E-02
ETOT 47 -1560.2736323600 -6.400E-07 8.383E-06 9.248E-03 1.722E-04 2.149E-02
but soon after it crashes when it attempts to write the WFK file
*********************************
At SCF step 47, etot is converged :
for the second time, diff in etot= 6.400E-07 < toldfe= 1.000E-06
forstrnps : usepaw= 0
forstrnps: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...
strhar : before mpi_comm, harstr= 1.072006351610150E-002
9.865018747734584E-003 9.871758113503866E-003 1.198027124816839E-007
1.494415344832276E-007 7.404431448065469E-004
strhar : after mpi_comm, harstr= 1.887503908462354E-002
1.887500817960579E-002 1.658736148652544E-002 1.388586001033960E-007
1.243844495641594E-007 -6.792742386575647E-008
strhar : ehart,ucvol= 232.336116098840 26865.6696106559
Cartesian components of stress tensor (hartree/bohr^3)
sigma(1 1)= -3.13906187E-04 sigma(3 2)= -1.47617615E-08
sigma(2 2)= -3.13907380E-04 sigma(3 1)= -3.67408460E-09
sigma(3 3)= -3.49901037E-04 sigma(2 1)= 5.10828585E-09
ioarr: writing density data
ioarr: file name is 4exciteo_DEN
m_wffile.F90:272:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length= 4
ioarr: data written to disk file 4exciteo_DEN
-P-0000 leave_test : synchronization done...
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file 4exciteo_WFK
-P-0000 leave_test : synchronization done...
m_wffile.F90:272:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length= 4
***********************************************************************************************
portion of the log file at the beginning is listed below
************** portion of output from log file *************************************
==== FFT mesh ====
FFT mesh divisions ........................ 216 216 240
Augmented FFT divisions ................... 217 217 240
FFT algorithm ............................. 401
FFT cache size ............................ 16
FFT parallelization level ................. 1
Number of processors in my FFT group ...... 2
Index of me in my FFT group ............... 0
No of xy planes in R space treated by me .. 108
No of xy planes in G space treated by me .. 120
MPI communicator for FFT .................. 0
Value of ngfft(15:18) ..................... 0 0 0 0
getmpw: optimal value of mpw= 33137
getdim_nloc : enter
pspheads(1)%nproj(0:3)= 0 1 1 1
getdim_nloc : deduce lmnmax = 15, lnmax = 3,
lmnmaxso= 15, lnmaxso= 3.
memory : analysis of memory needs
================================================================================
Values of the parameters that define the memory need of the present run
intxc = 0 ionmov = 0 iscf = 7 xclevel = 1
lmnmax = 3 lnmax = 3 mband = 450 mffmem = 1
P mgfft = 240 mkmem = 1 mpssoang= 4 mpw = 33137
mqgrid = 3001 natom = 128 nfft = 5598720 nkpt = 4
nloalg = 4 nspden = 1 nspinor = 1 nsppol = 1
nsym = 1 n1xccc = 2501 ntypat = 3 occopt = 0
================================================================================
P This job should need less than 1841.517 Mbytes of memory.
Rough estimation (10% accuracy) of disk space for files :
WF disk file : 1820.272 Mbytes ; DEN or POT disk file : 42.717 Mbytes.
================================================================================
******************************************************
I have access to three different clusters with different file systems (raid1, bluearc-fc and lusterfs) and I see the same problem everywhere. Another observation is that if the estimated WF file is < 1GB then MPI I/O works fine (similar parallelization variables work) and the code is able to write the WFK properly.
I even tried to set accesswff to 0 (but still using paral_kgb 1) which successfully completes the calculation but the WFK file is not apparently complete since cut3d gives an error.
can someone please suggest ways to solve this issue ? I can post the input file if that will help (I am using TM pseudos).
thank you.
Anurag
I am running ABINIT 6.0.4 compiled on a linux cluster with Intel 10.0 fortran compiler. Build information is listed below. ABINIT was configured as follows:
$ ./configure --prefix=/global/home/users/achaudhry --enable-mpi-fft=yes --enable-mpi-io=yes --enable-fttw=yes --enable-64bit-flags=yes FC=mpif90 F77=mpif90 --enable-scalapack CC=mpicc CXX=mpiCC --with-mpi-runner=/global/software/centos-5.x86_64/modules/openmpi/1.4.1-intel/bin/mpirun --with-mpi-level=2 CC_LIBS=-lmpi CXX_LIBS=-lmpi++ -lmpi --with-fc-vendor=intel --with-linalg-includes=-I/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/include --with-linalg-libs=-L/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/lib/em64t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_intel_solver_lp64 -lmkl_lapack -lmkl_core -lguide -lpthread --with-scalapack-libs=-L/global/software/centos-5.x86_64/modules/mkl/10.0.4.023/lib/em64t -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -liomp5 -lpthread --disable-wannier90
=== Build Information ===
Version : 6.0.4
Build target : x86_64_linux_intel0.0
Build date : 20100520
=== Compiler Suite ===
C compiler : intel10.1
CFLAGS : -I/global/software/centos-5.x86_64/modules/openmpi/1.3.3-intel/include
C++ compiler : intel10.1
CXXFLAGS : -g -O2 -vec-report0
Fortran compiler : intel0.0
FCFLAGS : -g -extend-source -vec-report0
FC_LDFLAGS : -static-libgcc -static-intel
=== Optimizations ===
Debug level : yes
Optimization level : standard
Architecture : intel_xeon
=== MPI ===
Parallel build : yes
Parallel I/O : yes
=== Linear algebra ===
Library type : external
Use ScaLAPACK : yes
=== Plug-ins ===
BigDFT : yes
ETSF I/O : yes
LibXC : yes
FoX : no
NetCDF : yes
Wannier90 : no
=== Experimental features ===
Bindings : no
Error handlers : no
Exports : no
GW double-precision : no
Macroave build : yes
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CPP options activated during the build:
CC_INTEL CXX_INTEL FC_INTEL
HAVE_FC_EXIT HAVE_FC_FLUSH HAVE_FC_GET_ENVIRONMEN...
HAVE_FC_LONG_LINES HAVE_FC_NULL HAVE_MPI
HAVE_MPI2 HAVE_MPI_IO HAVE_SCALAPACK
HAVE_STDIO_H USE_MACROAVE