Bug in ncache assignment for MPI FFTs
Posted: Mon Nov 19, 2012 6:46 pm
Hi all,
one of our users here at the Irish Centre for High-End Computing ran into an error while running Abinit 6.12.3. The error message printed right before the execution stopped was as below:
I was able to successfully replicate his 'ncache' issue and then to trace it to the FFT routines within Abinit, contained in abinit-6.12.3/src/52_fft_mpi_noabirule. Variable 'ncache' defines the size of the working area for the FFT algorithm and leads to a program finalisation in case the work array is too small to even fit a single dimensional transform. The issue is that the value for 'ncache' is hardcoded in the code and the execution won't be able to proceed if any of the FFT dimensions exceed 1024, which is the scenario the user faced. The piece of code that caused the unexpected finalisation is the following:
In order to work around this issue I've managed to patch Abinit to allow for a variable work array for the FFT algorithm to be allocated. I replaced in files accrho.F90, applypot.F90, back.F90, back_wf.F90, forw.F90 and forw_wf.F90 all the assignments of 'ncache':
By the following, in order to allow the working area for the FFT algorithm to fit at least a single dimensional transform:
Where 'n1', 'n2' and 'n3' are the FFT dimensions.
Now that Abinit 6.12.3 is patched, the code passed the check that was leading to the unexpected finalisation, the job finalised successfully and the results yielded were physically consistent.
I appreciate if you can fix this bug in Abinit's main trunk.
If you'd like to replicate the bug, I've published the test case scenario here:
http://www-staff.ichec.ie/~rmiceli/abinit/
The Abinit 6.12.3 installation script I used is available here:
http://www-staff.ichec.ie/~rmiceli/abin ... -6.12.3.sh
The data sets, input files and PBS scripts are available here:
http://www-staff.ichec.ie/~rmiceli/abin ... test-case/
(where 01-H.LDA.fhi and 14-Si.LDA.fhi are the pseudopotentials; sislab.files and sislab.in are the input files; sislab.scr is the PBS script for job execution; and sislab.log, sislab.o1209327, sislab.out, sislab_STATUS and sislabo_OUT.nc are the output files after the execution unexpectedly terminates.)
You can also find the patches for the source files at abinit-6.12.3/src/52_fft_mpi_noabirule here:
http://www-staff.ichec.ie/~rmiceli/abin ... g/patches/
And here is the execution script already performing the patching, using 'sed':
http://www-staff.ichec.ie/~rmiceli/abin ... patched.sh
Please let me know if you'd like any more inputs.
Thank you in advance for your time and patience.
Kind regards,
Renato Miceli
one of our users here at the Irish Centre for High-End Computing ran into an error while running Abinit 6.12.3. The error message printed right before the execution stopped was as below:
ncache has to be enlarged to be able to hold at
least one 1-d FFT of each size even though this will
reduce the performance for shorter transform lengths
I was able to successfully replicate his 'ncache' issue and then to trace it to the FFT routines within Abinit, contained in abinit-6.12.3/src/52_fft_mpi_noabirule. Variable 'ncache' defines the size of the working area for the FFT algorithm and leads to a program finalisation in case the work array is too small to even fit a single dimensional transform. The issue is that the value for 'ncache' is hardcoded in the code and the execution won't be able to proceed if any of the FFT dimensions exceed 1024, which is the scenario the user faced. The piece of code that caused the unexpected finalisation is the following:
Code: Select all
ncache=4*1024
if (ncache/(4*max(n1,n2,n3)).lt.1) then
write(std_out,*) &
& ' ncache has to be enlarged to be able to hold at', &
& ' least one 1-d FFT of each size even though this will', &
& ' reduce the performance for shorter transform lengths'
stop
end if
In order to work around this issue I've managed to patch Abinit to allow for a variable work array for the FFT algorithm to be allocated. I replaced in files accrho.F90, applypot.F90, back.F90, back_wf.F90, forw.F90 and forw_wf.F90 all the assignments of 'ncache':
Code: Select all
ncache=4*1024
By the following, in order to allow the working area for the FFT algorithm to fit at least a single dimensional transform:
Code: Select all
ncache=4*max(n1,n2,n3,1024)
Where 'n1', 'n2' and 'n3' are the FFT dimensions.
Now that Abinit 6.12.3 is patched, the code passed the check that was leading to the unexpected finalisation, the job finalised successfully and the results yielded were physically consistent.
I appreciate if you can fix this bug in Abinit's main trunk.
If you'd like to replicate the bug, I've published the test case scenario here:
http://www-staff.ichec.ie/~rmiceli/abinit/
The Abinit 6.12.3 installation script I used is available here:
http://www-staff.ichec.ie/~rmiceli/abin ... -6.12.3.sh
The data sets, input files and PBS scripts are available here:
http://www-staff.ichec.ie/~rmiceli/abin ... test-case/
(where 01-H.LDA.fhi and 14-Si.LDA.fhi are the pseudopotentials; sislab.files and sislab.in are the input files; sislab.scr is the PBS script for job execution; and sislab.log, sislab.o1209327, sislab.out, sislab_STATUS and sislabo_OUT.nc are the output files after the execution unexpectedly terminates.)
You can also find the patches for the source files at abinit-6.12.3/src/52_fft_mpi_noabirule here:
http://www-staff.ichec.ie/~rmiceli/abin ... g/patches/
And here is the execution script already performing the patching, using 'sed':
http://www-staff.ichec.ie/~rmiceli/abin ... patched.sh
Please let me know if you'd like any more inputs.
Thank you in advance for your time and patience.
Kind regards,
Renato Miceli