Bug: MPI-IO hang in WffReadWrite_mpio

kaneod · Post by **kaneod** » Thu May 10, 2012 2:11 pm

Hi all,

We're getting a MPI-IO hang when MPI_FILE_WRITE_ALL is called on line 229 of the wffreadwrite_mpio.F90 routine in Abinit 6.12.3. The file to be written is on a GPFS file system and we only get the hang when we run across multiple nodes on our (infiniband-connected) cluster. It's been rather difficult to isolate the problem so far, so I've included an input file below that causes the hang and a case that doesn't.

We've found this occurs with both OpenMPI (1.4.5) and MVAPICH2 (1.8) and one of the cluster sysadmins has run ROMIO tests, we seem to have no problem with the drivers and so on. strace reveals the hang consists of repeated POLL events but that's as far as we can go ourselves as I'm not that experienced with valgrind.

We use gfortran 4.6, fftw3 and openblas.

A case that hangs:

Code: Select all

# porphin molecule

# Setting this to 1 will cause the job to hang, 0 to not hang!
prtwf       1

# Parallelization - set for 16 processors. Use over at least two nodes to see the hang.
paral_kgb       1
npkpt           1
npband          8
npfft           2       
bandpp          2

# SCF parameters

ecut        4.0
pawecutdg   12.0
tolvrs      1.0d-1 # deliberately bad to quickly get to the output stage outwf.F90
nstep       200
istwfk      2
occopt      7
tsmear      0.02
diemix      0.33
diemac      2.0     

# Kpoints
kptopt      1
ngkpt       1   1   1
nshiftk     1
shiftk
    0.0 0.0 0.0
nband 80

# Geometry

acell       24.0    24.0    14.0     angstrom
natom       38
ntypat      3
znucl       6 7 1
typat       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
            1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 
            3 3 3 3 3 3
xangst
14.43267826697125     9.54719469654066     6.98679555192892
9.56971890234498     9.55740143113795     6.98136687410452
9.57830669996704    14.45323690496433     6.99729471155192
14.44063286032851    14.44280830865907     7.00065218638275
13.09053535559251     9.13328020797658     6.98306316607677
12.67923156052460     7.73168327773270     6.97936199186946
11.31569250083367     7.73433869282024     6.97757649970769
10.90999275359248     9.13761393307977     6.98028668551213
9.09554011542721    10.87314035126708     6.98561843864410
7.72903461616869    11.31936915287400     6.98735375088374
7.73139760192924    12.69760130406586     6.99113573510774
9.09938031597771    13.13925447483067     6.99243851467787
10.92010300631924    14.86758534532777     7.00085421660784
11.33151776548370    16.26917592327966     7.00566526340061
12.69511949460426    16.26616429625911     7.00672787336548
13.10041479027480    14.86264378806152     7.00242954254950
14.91526145037882    13.12729369433697     6.99679680639886
16.28188688560975    12.68157456405687     6.99612434170756
16.28001488882063    11.30329858626045     6.99265482691192
14.91210592507385    10.86110781146765     6.99071674837176
14.13078870987637    11.99535250594183     6.99326588288911
9.88047357875540    12.00482853795923     6.98882891763749
12.00192233673013     9.96929220396100     6.98329443280378
12.00837527590531    14.03130874233860     6.99948340730862
15.20432290532285    15.22540747831387     7.00200372856651
15.19313884171311     8.76129846997055     6.98744336798576
8.80561700326767     8.77507783974137     6.97949471039301
8.81771718378007    15.23885330665432     6.99845713352080
13.37257765050554    17.11992092452597     7.01033475042726
10.65751847151932    17.12561065649051     7.00718932911153
17.14387397433472    13.34609132569375     6.99830189384125
17.14042276570114    10.63682513020204     6.99118969243677
13.35278112781582     6.87498761980712     6.97878434452385
10.63876867003055     6.88026233515408     6.97553810680641
6.87131218204644    13.36438702790796     6.99316119368211
6.86695286671293    10.65501172582219     6.98633580459881
10.90483155436783    12.00301405431957     6.99005188546697
13.10645914541739    11.99653337020754     6.99332769226049

A case that doesn't hang:

Code: Select all

# Simple diamond

# Again, settings are low to skip to the important hang part.

# SCF/Paral - set for 16 procs over 2 nodes here. Note that on my system
# if I set bandpp 2 or 4, I get some nasty MPI errors also (a problem to
# look forward to!).

paral_kgb   1
npkpt       2
npband      8
npfft       2
bandpp      1

nband       32

ecut    6.0
pawecutdg 15.0
toldfe  1.0d-2

# Cell geometry
acell       3.57 3.57 3.57 angstrom
natom       2
ntypat      1
typat       1 1
znucl       6
rprim
    0.0 0.5 0.5
    0.5 0.0 0.5
    0.5 0.5 0.0
xred
    0.0 0.0 0.0
    0.25 0.25 0.25

# MP sampling
kptopt      1
ngkpt       2 2 2
nshiftk     4
shiftk
    0.5 0.5 0.5
    0.5 0.0 0.0
    0.0 0.5 0.0
    0.0 0.0 0.5

I believe this hang might be at least one of the reasons people have reported MPI-IO problems. We're continuing analysis here and will post stuff as we can come up with it.

mverstra · Post by **mverstra** » Sun Jun 10, 2012 6:08 pm

Hi Kane,

thanks for the very detailed debugging. This is indeed delicate, and despite your checks might be due to the specific mpi-io implementation and not abinit (stress _might_). I have had lots of problems running across nodes with certain versions of openmpi - usually older versions like 1.2.6 were better. Could also be due to the io... The fact that mvapich also crashes is suspicious, however. Does anyone know if recent versions implement stricter versions of mpi-io?

cheers

Matthieu

kaneod · Post by **kaneod** » Tue Jun 12, 2012 7:53 am

Hi Matthieu,

We've made some progress in this recently but we have not yet isolated the problematic sequence of events. It appears the problem is that different MPI threads are blocking each others attempts at writing during the MPI_FILE_WRITE_ALL call. We don't know exactly how yet. It only happens across infiniband and not on a single node, but the ROMIO implementation passes all conventional tests across nodes.

All we can say for sure at the moment is that this is a thread safety/file locking problem. The WFK write is not a speed-critical element compared to the other sections of the code so one long-term strategy might be to have a safe gather policy and only write from the master process. My problem is that I need KGB parallelization and mpi-io is a prerequisite - otherwise I'd not compile it in.

Since file locking is the problem, our guy working on this has come up with a partial workaround that prevents the MPI-IO hang. Once he's happy that he understands the problem and how his solution fixes it, we'll patch my private dev branch and test. Might be a while!

Kane

mverstra · Post by **mverstra** » Fri Aug 09, 2013 11:25 pm

Hi Kane,

any news on this (and perhaps fixes which appeared in 7.2)?

ABINIT Discussion Forums

Bug: MPI-IO hang in WffReadWrite_mpio

Bug: MPI-IO hang in WffReadWrite_mpio

Re: Bug: MPI-IO hang in WffReadWrite_mpio

Re: Bug: MPI-IO hang in WffReadWrite_mpio

Re: Bug: MPI-IO hang in WffReadWrite_mpio