Page 1 of 1

Job exiting after P000

Posted: Thu Aug 05, 2010 11:15 pm
by JEJohns
While running abinit, I'm consistently getting jobs which exit early on larger jobs (either during the first iteration or beforehand) just after the following message.

"-P-0000 leave_test : synchronization done...
-P-0000 leave_test : exiting..."

This happened when I went from a 2x2 unit cell of graphene to a 3x3 unit cell of graphene, and when I increased the vacuum spacing on a silver slab to a large unit cell. I would have thought it was due to a memory usage issue, but it doesn't go away if I increase the number of nodes & processors (& thus the available memory). Am I doing something dumb? I've attached the input, output, and log files for the graphene unit cell where this happens (labelled as graphod.* for my own personal naming reasons). I've queued this job on 2 nodes with 20GB on each node.

Re: Job exiting after P000

Posted: Thu Aug 05, 2010 11:22 pm
by JEJohns
I'm so sorry this is clearly in the wrong forum. I meant to place this in the Input file or Platform specific forum. I'm running these jobs on, one of the Lawrence Berkeley National Lab computers using abinit 6.0.3

Re: Job exiting after P000

Posted: Sat Sep 04, 2010 9:31 am
by mverstra
Hello JEJohns,

in the latest versions (6.2 I think, and certainly upcoming 6.4) I have made this error message more verbose: it arises because some of your processors are not responding. This can be due to:

1) some of them didn't manage to allocate memory. Check the individual *LOG* files for each one. This is the most common reason.
2) you have chosen a proc distribution which is not consistent (e.g. more than you have k-points) in which case some processors end up empty and complain.
3) other, real, error on the nodes. If possible check which nodes are complaining.


Re: Job exiting after P000

Posted: Wed Sep 08, 2010 7:22 am
by JEJohns
Thanks for the reply. Unfortunately, I've switched positions and am now working @ Northwestern & argon, but I hear that they upgraded to 6.2.1, so I'll pass the info along. Thanks