Page 1 of 1

Abinit 6.12.2 hanging on WFK write with MPI-IO

Posted: Tue Mar 27, 2012 4:43 am
by kaneod
I'm in a bit of a bind. I need the WFK files for a series of rather large molecules (38-173 atoms) so I can visualize the HOMO/LUMO for each. Since I'm only running a gamma point calculation, I'm using KGB parallelization over up to 640 bands and over up to 48 processors, 8 processors per node. Temporary files are being written to each node's tmp area but the collective files (outputs) are written to a fast GPFS filesystem.

If I compile MPI-IO support into Abinit, the jobs hang at the stage of writing the WFK files. The last log output is (for any sized molecule):

Code: Select all

 ----iterations are completed or convergence reached----

 outwf  : write wavefunction to file poro_WFK
-P-0000  leave_test : synchronization done...

 m_wffile.F90:279:COMMENT
   MPI/IO accessing FORTRAN file header: detected record mark length=4


The code does not proceed any further although the processor utilization appears to stay at near 100% of normal (typically at least 95% over all the processors/nodes). The WFK file as written is about 800-850K in size, I suppose it just contains a header.

Technical details: the code is compiled using gfortran 4.6.2, fftw3, openblas, against either OpenMPI 1.4.5 or mvapich2 (1.8 and HEAD). Same error regardless of MPI implementation. This is odd, as mvapich2 has the type error problem still when compiling with respect to mpi level 2, so it is necessary to use mpi level 1 to get abinit compiled. Since the MPI-IO routines aren't actually in mpi.h as far as I'm aware, you'd think something would break catastrophically for mvapich2 and MPI-IO.

If I disable MPI-IO, the jobs complete but the WFK files are incomplete. They are larger (at least several MB) but not large enough to contain all the cgs, especially for the larger molecules.

Note that the DEN files never have any trouble writing, but it doesn't appear to be a filesize problem (as has been hinted in previous threads) because the hang occurs regardless of the molecule (for example, the 38-atom porphine core attached as an example input fails).

Another thing is that I don't have this problem on my workstation with MPI-IO enabled and using MPICH2. My molecules are too big for my workstation to deal with in a reasonable amount of time. Possibly mpi-io is a bit broken on our cluster, but what steps should I take to try and isolate this further for the developers?

I've pasted in an input file, build ac file and jobscript for our cluster that will reproduce the problem. Still can't add attachments (even with txt extensions...).

Thanks,

Kane

Input file:

Code: Select all

# porphin molecule


#iscf        -3
#occopt      1
nband       320
#nbdbuf      160

# DOS stuff

#prtdos      3
#natsph      20
#iatsph      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#pawprtdos   2

# SCF parameters

ecut        14.0
pawecutdg   35.0
#toldff      5.0d-6
#toldfe      4.0d-8         # Tighten up convergence slightly
tolvrs      1.0d-10
nstep       200
#istwfk      2               # Gamma-point only means real wfs.
#occopt      7               # Gaussian smearing
#tsmear      0.02
diemix      0.33
diemac      2.0
#timopt     2

# Geometry Optimization

#optcell     0
#ionmov     2
#tolmxf     5.0d-5
#ntime      100         

# Kpoints

# Note: for DOS, need more than gamma point.

kptopt      1
ngkpt       1   1   1
#ngkpt       2 2 2     # For DOS
nshiftk     1
shiftk
    0.0 0.0 0.0             # Need to use the true gamma point.
#nband       80             # Done to fix an input bug.

# Parallelization

paral_kgb       1
npkpt            1
npspinor         1
npband           8
npfft            2
bandpp           4

# Geometry

acell       24.0    24.0    14.0     angstrom
natom       38
ntypat      3
znucl       6 7 1
typat       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
            1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3
            3 3 3 3 3 3
xangst
14.43267826697125     9.54719469654066     6.98679555192892
9.56971890234498     9.55740143113795     6.98136687410452
9.57830669996704    14.45323690496433     6.99729471155192
14.44063286032851    14.44280830865907     7.00065218638275
13.09053535559251     9.13328020797658     6.98306316607677
12.67923156052460     7.73168327773270     6.97936199186946
11.31569250083367     7.73433869282024     6.97757649970769
10.90999275359248     9.13761393307977     6.98028668551213
9.09554011542721    10.87314035126708     6.98561843864410
7.72903461616869    11.31936915287400     6.98735375088374
7.73139760192924    12.69760130406586     6.99113573510774
9.09938031597771    13.13925447483067     6.99243851467787
10.92010300631924    14.86758534532777     7.00085421660784
11.33151776548370    16.26917592327966     7.00566526340061
12.69511949460426    16.26616429625911     7.00672787336548
13.10041479027480    14.86264378806152     7.00242954254950
14.91526145037882    13.12729369433697     6.99679680639886
16.28188688560975    12.68157456405687     6.99612434170756
16.28001488882063    11.30329858626045     6.99265482691192
14.91210592507385    10.86110781146765     6.99071674837176
14.13078870987637    11.99535250594183     6.99326588288911
9.88047357875540    12.00482853795923     6.98882891763749
12.00192233673013     9.96929220396100     6.98329443280378
12.00837527590531    14.03130874233860     6.99948340730862
15.20432290532285    15.22540747831387     7.00200372856651
15.19313884171311     8.76129846997055     6.98744336798576
8.80561700326767     8.77507783974137     6.97949471039301
8.81771718378007    15.23885330665432     6.99845713352080
13.37257765050554    17.11992092452597     7.01033475042726
10.65751847151932    17.12561065649051     7.00718932911153
17.14387397433472    13.34609132569375     6.99830189384125
17.14042276570114    10.63682513020204     6.99118969243677
13.35278112781582     6.87498761980712     6.97878434452385
10.63876867003055     6.88026233515408     6.97553810680641
6.87131218204644    13.36438702790796     6.99316119368211
6.86695286671293    10.65501172582219     6.98633580459881
10.90483155436783    12.00301405431957     6.99005188546697
13.10645914541739    11.99653337020754     6.99332769226049


Build ac file:

Code: Select all

enable_64bit_flags="yes"
#enable_optim="yes"
enable_debug="no"
prefix="${HOME}/local"

CPP="$HOME/local/bin/cpp-4.6"
CC="$HOME/local/bin/mpicc"
#CFLAGS_OPTIM="-O3 -march=native"
#C_LDFLAGS="-static-libgcc"
CXX="$HOME/local/bin/mpicxx"
#CXXFLAGS_OPTIM="-O3 -march=native"
FC="$HOME/local/bin/mpif90"
F77="$HOME/local/bin/mpif77"
#FCFLAGS_DEBUG="-g -fopenmp"
#FCFLAGS_OPTIM="-O3 -march=native -mtune=native -funroll-loops -floop-block -flto"
#FC_LDFLAGS="-flto -static-libgfortran -static-libgcc"
FC_LIBS_EXTRA="-lgomp"

#enable_stdin="no"
#fcflags_opt_59_io_mpi="-O2"
#fcflags_opt_51_manage_mpi="-O2"

enable_mpi="yes"
enable_mpi_io="yes"
with_mpi_level="1"
MPI_RUNNER="$HOME/local/bin/mpirun"

with_trio_flavor="netcdf+etsf_io"
#with_etsf_io_incs="-I/opt/etsf/include"
#with_etsf_io_libs="-L/opt/etsf/lib -letsf_io"
#with_netcdf_incs="-I/usr/local/include/netcdf"
#with_netcdf_libs="-L/usr/local/lib/netcdf -lnetcdff -lnetcdf"

with_fft_flavor="fftw3"
with_fft_incs="-I$HOME/local/include"
with_fft_libs="-L$HOME/local/lib -lfftw3"

with_linalg_flavor="custom"
with_linalg_incs="-I$HOME/local/include"
with_linalg_libs="-L$HOME/local/lib -lopenblas"

with_dft_flavor="atompaw+libxc+wannier90"

with_libxc_incs="-I$HOME/local/include"
with_libxc_libs="-L$HOME/local/lib -lxc"

#enable_fallbacks="no"

#enable_gw_cutoff="yes"
enable_gw_dpc="yes"
enable_gw_openmp="yes"
#enable_gw_optimal="yes"
#enable_gw_wrapper="yes"
enable_smp="yes"

enable_fast_check="yes"


Jobscript:

Code: Select all

#!/bin/bash

#PBS -S /bin/bash
#PBS -N porphin-ORB1
#PBS -l nodes=2:ppn=8,walltime=04:00:00,pmem=4000MB
#PBS -joe
#PBS -V

EXTRA_FILES=
ABINIT=$HOME/test/bin/abinit
MPIEXEC="$HOME/local/bin/mpiexec-osc"
#export LD_LIBRARY_PATH=$HOME/test/lib64:$HOME/test/lib:$LD_LIBRARY_PATH

cd $PBS_O_WORKDIR


# Temporary directory
export TMPDIR="$HOME/ASync002_scratch"
# Work directory
export WORKDIR="$HOME/ASync002_scratch/${PBS_JOBID}-abinit-${PBS_JOBNAME}"
mkdir $WORKDIR

# Copy ab.files and the input file into the work directory and go there.
cp ab.files *.in $EXTRA_FILES $WORKDIR
cd $WORKDIR

# Modify ab.files so that the tmp files are always local - PBS
# or mpiexec creates a directory /tmp/$PBS_JOBID on each node.
sed -i "s_>REPLACE<_/tmp/${PBS_JOBID}/tmp_" ab.files

# Make sure we explicitly set OMP_NUM_THREADS or who knows
# what OpenMP nightmares will happen.
export OMP_NUM_THREADS=1

# Get our nodefile and write it to this directory for reference.
cp $PBS_NODEFILE .

# Now run (hopefully PBS gives us the right nodes!)
$MPIEXEC $ABINIT < ab.files >& log

# Move the job folder to the original submission dir.
cd ..
mv $WORKDIR $PBS_O_WORKDIR

Re: Abinit 6.12.2 hanging on WFK write with MPI-IO

Posted: Thu Mar 29, 2012 2:52 am
by kaneod
Well, I'll do a self-response on this: it seems that even with paral_kgb=1, where the code complains that I shouldn't be able to do input/output, I get proper WFK files by compiling abinit without MPI-IO. So for now, I'm just running without it. Performance isn't too bad.

...

I'm sorry I can't shed more light on this as there do seem to be some problems writing WFK files in a variety of circumstances (do a WFK search on the forums) and they don't seem to be solved in 6.12.2 for some people. Is there anything the developers would like me to do to help check this in more detail?

Kane