While running abinit, I'm consistently getting jobs which exit early on larger jobs (either during the first iteration or beforehand) just after the following message.
"-P-0000 leave_test : synchronization done...
-P-0000 leave_test : exiting..."
This happened when I went from a 2x2 unit cell of graphene to a 3x3 unit cell of graphene, and when I increased the vacuum spacing on a silver slab to a large unit cell. I would have thought it was due to a memory usage issue, but it doesn't go away if I increase the number of nodes & processors (& thus the available memory). Am I doing something dumb? I've attached the input, output, and log files for the graphene unit cell where this happens (labelled as graphod.* for my own personal naming reasons). I've queued this job on 2 nodes with 20GB on each node.
Job exiting after P000
Moderator: bguster
Job exiting after P000
- Attachments
-
graphod.in
- (871 Bytes) Downloaded 228 times
-
graphod.log
- (16.9 KiB) Downloaded 241 times
-
graphod.out
- (3.05 KiB) Downloaded 210 times
Re: Job exiting after P000
I'm so sorry this is clearly in the wrong forum. I meant to place this in the Input file or Platform specific forum. I'm running these jobs on carver.nersc.gov, one of the Lawrence Berkeley National Lab computers using abinit 6.0.3
Re: Job exiting after P000
Hello JEJohns,
in the latest versions (6.2 I think, and certainly upcoming 6.4) I have made this error message more verbose: it arises because some of your processors are not responding. This can be due to:
1) some of them didn't manage to allocate memory. Check the individual *LOG* files for each one. This is the most common reason.
2) you have chosen a proc distribution which is not consistent (e.g. more than you have k-points) in which case some processors end up empty and complain.
3) other, real, error on the nodes. If possible check which nodes are complaining.
Matthieu
in the latest versions (6.2 I think, and certainly upcoming 6.4) I have made this error message more verbose: it arises because some of your processors are not responding. This can be due to:
1) some of them didn't manage to allocate memory. Check the individual *LOG* files for each one. This is the most common reason.
2) you have chosen a proc distribution which is not consistent (e.g. more than you have k-points) in which case some processors end up empty and complain.
3) other, real, error on the nodes. If possible check which nodes are complaining.
Matthieu
Matthieu Verstraete
University of Liege, Belgium
University of Liege, Belgium
Re: Job exiting after P000
Thanks for the reply. Unfortunately, I've switched positions and am now working @ Northwestern & argon, but I hear that they upgraded carver.nersc.gov to 6.2.1, so I'll pass the info along. Thanks
James
James