[SOLVED]screening calculation terminates with too many procs

raul_l · Post by **raul_l** » Sat Jan 19, 2013 11:47 pm

I'm not sure if the following problem is related to abinit, Intel mpi or something else. I thought I'd ask here first and then consult the cluster administrators if necessary.
I'm trying to do a screening calculation using the following input:

Code: Select all

# CdWO4 - geometry optimized for WC06 10/01/13-gw

acell 9.5002266139 11.166582897 9.6076647657
angdeg 90 91.1744636516631 90
ntypat 3
znucl 48 74 8
natom 12
typat 1 1 2 2 3 3 3 3 3 3 3 3
xred
       0.500000000000      0.310555555482      0.750000000000
      -0.500000000000     -0.310555555482     -0.750000000000
       0.000000000000      0.176323504801      0.250000000000
       0.000000000000     -0.176323504801     -0.250000000000
       0.245393128716      0.370110158446      0.385018997706
      -0.245393128716     -0.370110158446     -0.385018997706
       0.245393128716     -0.370110158446      0.885018997706
      -0.245393128716      0.370110158446     -0.885018997706
       0.201581562187      0.096340377830      0.945468039038
      -0.201581562187     -0.096340377830     -0.945468039038
       0.201581562187     -0.096340377830      1.445468039038
      -0.201581562187      0.096340377830     -1.445468039038

ngkpt 3*2
nshiftk 1
shiftk 3*0
nband 2874 # 58 vb, 2816 cb

gwcomp 1 gwencomp 7

ecut 17
pawecutdg 60
ecuteps 8

gwpara 2

optdriver 3
# gwcalctyp 2

istwfk *1
symchi 1

inclvkb 2

With 256 processors the calculation runs for a few minutes but then ends with

Code: Select all

 Calculation status (      8 to be completed):
-P-0000  ik =    1 /    8 is =  1 done by processor   0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

If I reduce the number of bands to 1850 the calculation gets a bit farther but then ends with

Code: Select all

 Calculation status (      8 to be completed):
-P-0000  ik =    1 /    8 is =  1 done by processor   0
-P-0000  ik =    2 /    8 is =  1 done by processor   0
-P-0000  ik =    3 /    8 is =  1 done by processor   0
-P-0000  ik =    4 /    8 is =  1 done by processor   0
-P-0000  ik =    5 /    8 is =  1 done by processor   0
-P-0000  ik =    6 /    8 is =  1 done by processor   0
-P-0000  ik =    7 /    8 is =  1 done by processor   0
-P-0000  ik =    8 /    8 is =  1 done by processor   0
[0:klots11] unexpected disconnect completion event from [176:klots5]
[240:klots21] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 240
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 0
[16:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 16
[32:klots13] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 32
[48:klots14] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 48
[64:klots15] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 64
[80:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 80
[96:klots17] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 96
[112:klots18] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 112
[128:klots19] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 128
[144:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 144
[160:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 160
[208:klots8] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 208
[166:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 166
[92:klots16] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 92
[148:klots20] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 148
[24:klots12] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 24
[165:klots4] unexpected disconnect completion event from [176:klots5]
Assertion failed in file ../../dapl_conn_rc.c at line 1054: 0
internal ABORT - process 165
[proxy:0:1@klots12] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:1@klots12] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:9@klots20] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:9@klots20] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:10@klots4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:10@klots4] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[proxy:0:7@klots18] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed
[proxy:0:7@klots18] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event
[mpiexec@klots11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@klots11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@klots11] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@klots11] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

With 128 processors the calculation runs successfully. With 256 processors but gwcomp set to 0 the calculation also runs successfully. If I split the input over q-points (nqptdm, qptdm) and run both calculations with 128 processors, then one of them ends with a similar error as above, and the other seems to run successfully. I don't know where to look for a mistake. Does it look like a bug in abinit? Any ideas would be welcome.
Abinit-7.0.4 is configured with

Code: Select all

--enable-mpi FC=mpiifort CC=mpiicc CXX=mpiicpc \
--with-dft-flavor=libxc+wannier90 \
--enable-64bit-flags="yes" --with-linalg-flavor=mkl \
--with-fft-flavor=fftw3 \
--with-fft-incs="-I$HOME/local/fftw/include" \
--with-fft-libs="-L$HOME/local/fftw/lib -lfftw3"

This is my ulimit -a:

Code: Select all

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 385703
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 300
virtual memory          (kbytes, -v) 4000000
file locks                      (-x) unlimited

As a resource manager we use Torque. The Intel compilers have version 12.1.0 20110811.
Thanks

Alain_Jacques · Post by **Alain_Jacques** » Sun Jan 20, 2013 8:23 pm

Hi Raul,

As a first step, I would check if you didn't bong against the maximum walltime allowed by torque on the queue selected for your jobs or exceed the maximum memory allocated. The unexpected disconnect is a classic message when torque kills the job for insufficient resources
Next I would study the optimum number of parallel slots as reported by prt_kgb - sure of your 128/256 slots?

BTW, if you select --with-linalg-flavor=mkl without specifying how to link it, I doubt configure detected the right recipe and probably defaulted to the fallback netlib.

Kind regards,

Alain

raul_l · Post by **raul_l** » Sun Jan 20, 2013 11:16 pm

Thank you for the reply.
Maximum walltime of the queue is 400 hours, while my calculation only ran for a few minutes. This is the Torque summary of the calculation:

Code: Select all

Execution terminated
Exit_status=0
resources_used.cput=01:25:42
resources_used.mem=33177740kb
resources_used.vmem=40725464kb
resources_used.walltime=00:06:14

Also, I don't think memory consumption is high enough to hit any limit. I'm using 16 nodes with about 40 GB of memory available on each node. I emailed the administrators about my problem and am waiting for their reply.

What is prt_kgb? I don't know if 256 is optimal but using 826 bands (768 CBs) I have tried 64, 128 and 256 slots. 64 takes almost twice as much time as 128 and 128 is about 1.4 times slower compared to 256 slots.

I think MKL is linked just fine. At the end of configure it says

Code: Select all

* LINALG flavor = mkl (libs: auto-detected)

Also, by inspecting config.log it appears abinit has found all the correct libraries.

raul_l · Post by **raul_l** » Tue Jan 22, 2013 10:02 am

It turns out it was memory after all. 'vmem' of Torque is supposed to be the total memory used, but for some reason it only shows the memory consumption of a single node. As I said, about 40 GB per node is the limit, which explains it. I knew GW calculations were expensive but I didn't imagine I would be using over 600 GB of memory for that!

gmatteo · Post by **gmatteo** » Tue Jan 22, 2013 4:28 pm

gwmem 10

drastically decreases the memory requirements of your run at the price of more FFTs to execute.

raul_l · Post by **raul_l** » Thu Feb 21, 2013 10:57 pm

I tried gwmem 0, but saw only a 3% drop in memory usage, which is still large. Excecution time increased about 5%, so gwmem seems to have very little effect. I also tried mkmem 0, but got

Code: Select all

 Subroutine Unknown:0:ERROR
 mkmem=0 not yet implemented.

I've tried so many different variables but nothing seems to reduce memory. Perhaps the system is too complex and further memory reduction is not possible. I have 12 atoms in the unit cell, including Cd and W, each of which possess 8 partial waves. ecuteps seems to be converged at 9 Ha at the level of 50 meV.

ABINIT Discussion Forums

[SOLVED]screening calculation terminates with too many procs

[SOLVED]screening calculation terminates with too many procs

Re: screening calculation terminates with too many processor

Re: screening calculation terminates with too many processor

Re: screening calculation terminates with too many processor

Re: [SOLVED]screening calculation terminates with too many p

Re: [SOLVED]screening calculation terminates with too many p