If I compile MPI-IO support into Abinit, the jobs hang at the stage of writing the WFK files. The last log output is (for any sized molecule):
Code: Select all
----iterations are completed or convergence reached----
outwf : write wavefunction to file poro_WFK
-P-0000 leave_test : synchronization done...
m_wffile.F90:279:COMMENT
MPI/IO accessing FORTRAN file header: detected record mark length=4
The code does not proceed any further although the processor utilization appears to stay at near 100% of normal (typically at least 95% over all the processors/nodes). The WFK file as written is about 800-850K in size, I suppose it just contains a header.
Technical details: the code is compiled using gfortran 4.6.2, fftw3, openblas, against either OpenMPI 1.4.5 or mvapich2 (1.8 and HEAD). Same error regardless of MPI implementation. This is odd, as mvapich2 has the type error problem still when compiling with respect to mpi level 2, so it is necessary to use mpi level 1 to get abinit compiled. Since the MPI-IO routines aren't actually in mpi.h as far as I'm aware, you'd think something would break catastrophically for mvapich2 and MPI-IO.
If I disable MPI-IO, the jobs complete but the WFK files are incomplete. They are larger (at least several MB) but not large enough to contain all the cgs, especially for the larger molecules.
Note that the DEN files never have any trouble writing, but it doesn't appear to be a filesize problem (as has been hinted in previous threads) because the hang occurs regardless of the molecule (for example, the 38-atom porphine core attached as an example input fails).
Another thing is that I don't have this problem on my workstation with MPI-IO enabled and using MPICH2. My molecules are too big for my workstation to deal with in a reasonable amount of time. Possibly mpi-io is a bit broken on our cluster, but what steps should I take to try and isolate this further for the developers?
I've pasted in an input file, build ac file and jobscript for our cluster that will reproduce the problem. Still can't add attachments (even with txt extensions...).
Thanks,
Kane
Input file:
Code: Select all
# porphin molecule
#iscf -3
#occopt 1
nband 320
#nbdbuf 160
# DOS stuff
#prtdos 3
#natsph 20
#iatsph 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#pawprtdos 2
# SCF parameters
ecut 14.0
pawecutdg 35.0
#toldff 5.0d-6
#toldfe 4.0d-8 # Tighten up convergence slightly
tolvrs 1.0d-10
nstep 200
#istwfk 2 # Gamma-point only means real wfs.
#occopt 7 # Gaussian smearing
#tsmear 0.02
diemix 0.33
diemac 2.0
#timopt 2
# Geometry Optimization
#optcell 0
#ionmov 2
#tolmxf 5.0d-5
#ntime 100
# Kpoints
# Note: for DOS, need more than gamma point.
kptopt 1
ngkpt 1 1 1
#ngkpt 2 2 2 # For DOS
nshiftk 1
shiftk
0.0 0.0 0.0 # Need to use the true gamma point.
#nband 80 # Done to fix an input bug.
# Parallelization
paral_kgb 1
npkpt 1
npspinor 1
npband 8
npfft 2
bandpp 4
# Geometry
acell 24.0 24.0 14.0 angstrom
natom 38
ntypat 3
znucl 6 7 1
typat 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3
3 3 3 3 3 3
xangst
14.43267826697125 9.54719469654066 6.98679555192892
9.56971890234498 9.55740143113795 6.98136687410452
9.57830669996704 14.45323690496433 6.99729471155192
14.44063286032851 14.44280830865907 7.00065218638275
13.09053535559251 9.13328020797658 6.98306316607677
12.67923156052460 7.73168327773270 6.97936199186946
11.31569250083367 7.73433869282024 6.97757649970769
10.90999275359248 9.13761393307977 6.98028668551213
9.09554011542721 10.87314035126708 6.98561843864410
7.72903461616869 11.31936915287400 6.98735375088374
7.73139760192924 12.69760130406586 6.99113573510774
9.09938031597771 13.13925447483067 6.99243851467787
10.92010300631924 14.86758534532777 7.00085421660784
11.33151776548370 16.26917592327966 7.00566526340061
12.69511949460426 16.26616429625911 7.00672787336548
13.10041479027480 14.86264378806152 7.00242954254950
14.91526145037882 13.12729369433697 6.99679680639886
16.28188688560975 12.68157456405687 6.99612434170756
16.28001488882063 11.30329858626045 6.99265482691192
14.91210592507385 10.86110781146765 6.99071674837176
14.13078870987637 11.99535250594183 6.99326588288911
9.88047357875540 12.00482853795923 6.98882891763749
12.00192233673013 9.96929220396100 6.98329443280378
12.00837527590531 14.03130874233860 6.99948340730862
15.20432290532285 15.22540747831387 7.00200372856651
15.19313884171311 8.76129846997055 6.98744336798576
8.80561700326767 8.77507783974137 6.97949471039301
8.81771718378007 15.23885330665432 6.99845713352080
13.37257765050554 17.11992092452597 7.01033475042726
10.65751847151932 17.12561065649051 7.00718932911153
17.14387397433472 13.34609132569375 6.99830189384125
17.14042276570114 10.63682513020204 6.99118969243677
13.35278112781582 6.87498761980712 6.97878434452385
10.63876867003055 6.88026233515408 6.97553810680641
6.87131218204644 13.36438702790796 6.99316119368211
6.86695286671293 10.65501172582219 6.98633580459881
10.90483155436783 12.00301405431957 6.99005188546697
13.10645914541739 11.99653337020754 6.99332769226049
Build ac file:
Code: Select all
enable_64bit_flags="yes"
#enable_optim="yes"
enable_debug="no"
prefix="${HOME}/local"
CPP="$HOME/local/bin/cpp-4.6"
CC="$HOME/local/bin/mpicc"
#CFLAGS_OPTIM="-O3 -march=native"
#C_LDFLAGS="-static-libgcc"
CXX="$HOME/local/bin/mpicxx"
#CXXFLAGS_OPTIM="-O3 -march=native"
FC="$HOME/local/bin/mpif90"
F77="$HOME/local/bin/mpif77"
#FCFLAGS_DEBUG="-g -fopenmp"
#FCFLAGS_OPTIM="-O3 -march=native -mtune=native -funroll-loops -floop-block -flto"
#FC_LDFLAGS="-flto -static-libgfortran -static-libgcc"
FC_LIBS_EXTRA="-lgomp"
#enable_stdin="no"
#fcflags_opt_59_io_mpi="-O2"
#fcflags_opt_51_manage_mpi="-O2"
enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_level="1"
MPI_RUNNER="$HOME/local/bin/mpirun"
with_trio_flavor="netcdf+etsf_io"
#with_etsf_io_incs="-I/opt/etsf/include"
#with_etsf_io_libs="-L/opt/etsf/lib -letsf_io"
#with_netcdf_incs="-I/usr/local/include/netcdf"
#with_netcdf_libs="-L/usr/local/lib/netcdf -lnetcdff -lnetcdf"
with_fft_flavor="fftw3"
with_fft_incs="-I$HOME/local/include"
with_fft_libs="-L$HOME/local/lib -lfftw3"
with_linalg_flavor="custom"
with_linalg_incs="-I$HOME/local/include"
with_linalg_libs="-L$HOME/local/lib -lopenblas"
with_dft_flavor="atompaw+libxc+wannier90"
with_libxc_incs="-I$HOME/local/include"
with_libxc_libs="-L$HOME/local/lib -lxc"
#enable_fallbacks="no"
#enable_gw_cutoff="yes"
enable_gw_dpc="yes"
enable_gw_openmp="yes"
#enable_gw_optimal="yes"
#enable_gw_wrapper="yes"
enable_smp="yes"
enable_fast_check="yes"
Jobscript:
Code: Select all
#!/bin/bash
#PBS -S /bin/bash
#PBS -N porphin-ORB1
#PBS -l nodes=2:ppn=8,walltime=04:00:00,pmem=4000MB
#PBS -joe
#PBS -V
EXTRA_FILES=
ABINIT=$HOME/test/bin/abinit
MPIEXEC="$HOME/local/bin/mpiexec-osc"
#export LD_LIBRARY_PATH=$HOME/test/lib64:$HOME/test/lib:$LD_LIBRARY_PATH
cd $PBS_O_WORKDIR
# Temporary directory
export TMPDIR="$HOME/ASync002_scratch"
# Work directory
export WORKDIR="$HOME/ASync002_scratch/${PBS_JOBID}-abinit-${PBS_JOBNAME}"
mkdir $WORKDIR
# Copy ab.files and the input file into the work directory and go there.
cp ab.files *.in $EXTRA_FILES $WORKDIR
cd $WORKDIR
# Modify ab.files so that the tmp files are always local - PBS
# or mpiexec creates a directory /tmp/$PBS_JOBID on each node.
sed -i "s_>REPLACE<_/tmp/${PBS_JOBID}/tmp_" ab.files
# Make sure we explicitly set OMP_NUM_THREADS or who knows
# what OpenMP nightmares will happen.
export OMP_NUM_THREADS=1
# Get our nodefile and write it to this directory for reference.
cp $PBS_NODEFILE .
# Now run (hopefully PBS gives us the right nodes!)
$MPIEXEC $ABINIT < ab.files >& log
# Move the job folder to the original submission dir.
cd ..
mv $WORKDIR $PBS_O_WORKDIR