Page 1 of 1

[SOLVED]screening calculation terminates with too many procs

Posted: Sat Jan 19, 2013 11:47 pm
by raul_l
I'm not sure if the following problem is related to abinit, Intel mpi or something else. I thought I'd ask here first and then consult the cluster administrators if necessary.
I'm trying to do a screening calculation using the following input:

Code: Select all

# CdWO4 - geometry optimized for WC06 10/01/13-gw

acell 9.5002266139 11.166582897 9.6076647657
angdeg 90 91.1744636516631 90
ntypat 3
znucl 48 74 8
natom 12
typat 1 1 2 2 3 3 3 3 3 3 3 3
xred
       0.500000000000      0.310555555482      0.750000000000
      -0.500000000000     -0.310555555482     -0.750000000000
       0.000000000000      0.176323504801      0.250000000000
       0.000000000000     -0.176323504801     -0.250000000000
       0.245393128716      0.370110158446      0.385018997706
      -0.245393128716     -0.370110158446     -0.385018997706
       0.245393128716     -0.370110158446      0.885018997706
      -0.245393128716      0.370110158446     -0.885018997706
       0.201581562187      0.096340377830      0.945468039038
      -0.201581562187     -0.096340377830     -0.945468039038
       0.201581562187     -0.096340377830      1.445468039038
      -0.201581562187      0.096340377830     -1.445468039038

ngkpt 3*2
nshiftk 1
shiftk 3*0
nband 2874 # 58 vb, 2816 cb

gwcomp 1 gwencomp 7

ecut 17
pawecutdg 60
ecuteps 8

gwpara 2

optdriver 3
# gwcalctyp 2

istwfk *1
symchi 1

inclvkb 2

With 256 processors the calculation runs for a few minutes but then ends with

Code: Select all

 Calculation status (      8 to be completed):
-P-0000  ik =    1 /    8 is =  1 done by processor   0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

If I reduce the number of bands to 1850 the calculation gets a bit farther but then ends with

Code: Select all

 Calculation status (      8 to be completed):
-P-0000  ik =    1 /    8 is =  1 done by processor   0
-P-0000  ik =    2 /    8 is =  1 done by processor   0
-P-0000  ik =    3 /    8 is =  1 done by processor   0
-P-0000  ik =    4 /    8 is =  1 done by processor   0
-P-0000  ik =    5 /    8 is =  1 done by processor   0
-P-0000  ik =    6 /    8 is =  1 done by processor   0
-P-0000  ik =    7 /    8 is =  1 done by processor   0
-P-0000  ik =    8 /    8 is =  1 done by processor   0
[0:klots11] unexpected disconnect completion event from [176:klots5]
[240:klots21] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 240
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 0
[16:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 16
[32:klots13] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 32
[48:klots14] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 48
[64:klots15] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 64
[80:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 80
[96:klots17] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 96
[112:klots18] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 112
[128:klots19] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 128
[144:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 144
[160:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 160
[208:klots8] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 208
[166:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 166
[92:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 92
[148:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 148
[24:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 24
[165:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 165
[proxy:0:1@klots12] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:1@klots12] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:9@klots20] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:9@klots20] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:10@klots4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:10@klots4] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:7@klots18] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:7@klots18] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[mpiexec@klots11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@klots11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@klots11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@klots11] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

With 128 processors the calculation runs successfully. With 256 processors but gwcomp set to 0 the calculation also runs successfully. If I split the input over q-points (nqptdm, qptdm) and run both calculations with 128 processors, then one of them ends with a similar error as above, and the other seems to run successfully. I don't know where to look for a mistake. Does it look like a bug in abinit? Any ideas would be welcome.
Abinit-7.0.4 is configured with

Code: Select all

--enable-mpi FC=mpiifort CC=mpiicc CXX=mpiicpc \
--with-dft-flavor=libxc+wannier90 \
--enable-64bit-flags="yes" --with-linalg-flavor=mkl \
--with-fft-flavor=fftw3 \
--with-fft-incs="-I$HOME/local/fftw/include" \
--with-fft-libs="-L$HOME/local/fftw/lib -lfftw3"

This is my ulimit -a:

Code: Select all

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 385703
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 300
virtual memory          (kbytes, -v) 4000000
file locks                      (-x) unlimited

As a resource manager we use Torque. The Intel compilers have version 12.1.0 20110811.
Thanks

Re: screening calculation terminates with too many processor

Posted: Sun Jan 20, 2013 8:23 pm
by Alain_Jacques
Hi Raul,

As a first step, I would check if you didn't bong against the maximum walltime allowed by torque on the queue selected for your jobs or exceed the maximum memory allocated. The unexpected disconnect is a classic message when torque kills the job for insufficient resources
Next I would study the optimum number of parallel slots as reported by prt_kgb - sure of your 128/256 slots?

BTW, if you select --with-linalg-flavor=mkl without specifying how to link it, I doubt configure detected the right recipe and probably defaulted to the fallback netlib.

Kind regards,

Alain

Re: screening calculation terminates with too many processor

Posted: Sun Jan 20, 2013 11:16 pm
by raul_l
Thank you for the reply.
Maximum walltime of the queue is 400 hours, while my calculation only ran for a few minutes. This is the Torque summary of the calculation:

Code: Select all

Execution terminated
Exit_status=0
resources_used.cput=01:25:42
resources_used.mem=33177740kb
resources_used.vmem=40725464kb
resources_used.walltime=00:06:14

Also, I don't think memory consumption is high enough to hit any limit. I'm using 16 nodes with about 40 GB of memory available on each node. I emailed the administrators about my problem and am waiting for their reply.

What is prt_kgb? I don't know if 256 is optimal but using 826 bands (768 CBs) I have tried 64, 128 and 256 slots. 64 takes almost twice as much time as 128 and 128 is about 1.4 times slower compared to 256 slots.

I think MKL is linked just fine. At the end of configure it says

Code: Select all

* LINALG flavor = mkl (libs: auto-detected)

Also, by inspecting config.log it appears abinit has found all the correct libraries.

Re: screening calculation terminates with too many processor

Posted: Tue Jan 22, 2013 10:02 am
by raul_l
It turns out it was memory after all. 'vmem' of Torque is supposed to be the total memory used, but for some reason it only shows the memory consumption of a single node. As I said, about 40 GB per node is the limit, which explains it. I knew GW calculations were expensive but I didn't imagine I would be using over 600 GB of memory for that!

Re: [SOLVED]screening calculation terminates with too many p

Posted: Tue Jan 22, 2013 4:28 pm
by gmatteo
gwmem 10

drastically decreases the memory requirements of your run at the price of more FFTs to execute.

Re: [SOLVED]screening calculation terminates with too many p

Posted: Thu Feb 21, 2013 10:57 pm
by raul_l
I tried gwmem 0, but saw only a 3% drop in memory usage, which is still large. Excecution time increased about 5%, so gwmem seems to have very little effect. I also tried mkmem 0, but got

Code: Select all

 Subroutine Unknown:0:ERROR
 mkmem=0 not yet implemented.

I've tried so many different variables but nothing seems to reduce memory. Perhaps the system is too complex and further memory reduction is not possible. I have 12 atoms in the unit cell, including Cd and W, each of which possess 8 partial waves. ecuteps seems to be converged at 9 Ha at the level of 50 meV.