I'm trying to do a screening calculation using the following input:
Code: Select all
# CdWO4 - geometry optimized for WC06 10/01/13-gw
acell 9.5002266139 11.166582897 9.6076647657
angdeg 90 91.1744636516631 90
ntypat 3
znucl 48 74 8
natom 12
typat 1 1 2 2 3 3 3 3 3 3 3 3
xred
0.500000000000 0.310555555482 0.750000000000
-0.500000000000 -0.310555555482 -0.750000000000
0.000000000000 0.176323504801 0.250000000000
0.000000000000 -0.176323504801 -0.250000000000
0.245393128716 0.370110158446 0.385018997706
-0.245393128716 -0.370110158446 -0.385018997706
0.245393128716 -0.370110158446 0.885018997706
-0.245393128716 0.370110158446 -0.885018997706
0.201581562187 0.096340377830 0.945468039038
-0.201581562187 -0.096340377830 -0.945468039038
0.201581562187 -0.096340377830 1.445468039038
-0.201581562187 0.096340377830 -1.445468039038
ngkpt 3*2
nshiftk 1
shiftk 3*0
nband 2874 # 58 vb, 2816 cb
gwcomp 1 gwencomp 7
ecut 17
pawecutdg 60
ecuteps 8
gwpara 2
optdriver 3
# gwcalctyp 2
istwfk *1
symchi 1
inclvkb 2
With 256 processors the calculation runs for a few minutes but then ends with
Code: Select all
Calculation status ( 8 to be completed):
-P-0000 ik = 1 / 8 is = 1 done by processor 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
If I reduce the number of bands to 1850 the calculation gets a bit farther but then ends with
Code: Select all
Calculation status ( 8 to be completed):
-P-0000 ik = 1 / 8 is = 1 done by processor 0
-P-0000 ik = 2 / 8 is = 1 done by processor 0
-P-0000 ik = 3 / 8 is = 1 done by processor 0
-P-0000 ik = 4 / 8 is = 1 done by processor 0
-P-0000 ik = 5 / 8 is = 1 done by processor 0
-P-0000 ik = 6 / 8 is = 1 done by processor 0
-P-0000 ik = 7 / 8 is = 1 done by processor 0
-P-0000 ik = 8 / 8 is = 1 done by processor 0
[0:klots11] unexpected disconnect completion event from [176:klots5]
[240:klots21] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 240
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 0
[16:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 16
[32:klots13] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 32
[48:klots14] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 48
[64:klots15] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 64
[80:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 80
[96:klots17] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 96
[112:klots18] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 112
[128:klots19] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 128
[144:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 144
[160:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 160
[208:klots8] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 208
[166:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 166
[92:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 92
[148:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 148
[24:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 24
[165:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 165
[proxy:0:1@klots12] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:1@klots12] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:9@klots20] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:9@klots20] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:10@klots4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:10@klots4] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:7@klots18] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:7@klots18] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[mpiexec@klots11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@klots11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@klots11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@klots11] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion
With 128 processors the calculation runs successfully. With 256 processors but gwcomp set to 0 the calculation also runs successfully. If I split the input over q-points (nqptdm, qptdm) and run both calculations with 128 processors, then one of them ends with a similar error as above, and the other seems to run successfully. I don't know where to look for a mistake. Does it look like a bug in abinit? Any ideas would be welcome.
Abinit-7.0.4 is configured with
Code: Select all
--enable-mpi FC=mpiifort CC=mpiicc CXX=mpiicpc \
--with-dft-flavor=libxc+wannier90 \
--enable-64bit-flags="yes" --with-linalg-flavor=mkl \
--with-fft-flavor=fftw3 \
--with-fft-incs="-I$HOME/local/fftw/include" \
--with-fft-libs="-L$HOME/local/fftw/lib -lfftw3"
This is my ulimit -a:
Code: Select all
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 385703
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 300
virtual memory (kbytes, -v) 4000000
file locks (-x) unlimited
As a resource manager we use Torque. The Intel compilers have version 12.1.0 20110811.
Thanks