paral_kgb crash on 130 processors

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
asorini
Posts: 13
Joined: Wed Apr 07, 2010 12:23 am

paral_kgb crash on 130 processors

Post by asorini » Thu May 27, 2010 12:32 am

Hi Forum,

I am using abinit 6.0.3 and openmpi 1.4.1 (both compiled with ifort 11) for parallel calculations of a large molecule near an Au(111) surface. However, I have been noticing some problems and was hoping to get advice.

For one thing I have noticed that sometimes I will start a parallel job and it will make it through about 100 scf cycles and then crash and I will run the exact same job again and it will crash after only 10 scf cycles. Any idea what the problem could be in this case?

As a more specific problem, I have recently been trying to obtain the WFK file for a large molecule near an au(111) surface for use with the stm capabilities of abinit... but my parallel runs on 130 processors keep crashing after only a few scf interations. I have attached the input file and the tail of the log file. I also include the error message sent by LSF below:

Job <pam -g 1 mympirun_wrapper abinit < gold4x5whybrid.files >& log> was submitted from host <simes0001> by user <asorini>.
Job was executed on host(s) <8*simes0011>, in queue <simesq>, as user <asorini>.
<8*simes0044>
<8*simes0032>
<8*simes0039>
<8*simes0053>
<8*simes0018>
<8*simes0064>
<8*simes0008>
<8*simes0014>
<8*simes0063>
<8*simes0048>
<8*simes0002>
<8*simes0059>
<8*simes0030>
<8*simes0022>
<8*simes0020>
<2*simes0027>
</u/xl/asorini> was used as the home directory.
</nfs/slac/g/simes/asorini/Gold4x5wHybrid/lessgoldlayers/STM_Stuff/again> was used as the working directory.
Started at Tue May 25 20:20:17 2010
Results reported at Wed May 26 10:16:10 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
pam -g 1 mympirun_wrapper abinit < gold4x5whybrid.files >& log
------------------------------------------------------------

Exited with exit code 134.

Resource usage summary:

CPU time :14710985.00 sec.
Max Memory : 253466 MB
Max Swap : 561752 MB

Max Processes : 130
Max Threads : 130

The output (if any) follows:

/nfs/farm/lsb_spool/1274839309.199744: line 8: 10628 Aborted (core dumped) pam -g 1 mympirun_wrapper abinit <gold4x5whybrid.files >&log

Any help with this would be very much appreciated. Cheers,

Adam
Attachments
gold4x5whybrid.in
input file
(8.52 KiB) Downloaded 379 times
log.log
log file
(47.86 KiB) Downloaded 357 times

Locked