abinit 6.4.1 + fftw => running time doubled?!

sbecuwe · Post by **sbecuwe** » Thu Nov 04, 2010 11:09 am

Hello,

I have installed 6.4.1 using icc 11.1.046, ifort 11.1.046, impi 3.2.1.009, imkl 10.2.1.017 (MKL + Scalapack) on Intel Harpertown architecture
with following plugins: ETSFIO 1.0.2, libxc svn-r6071-fixed, Wannier90 1.1, BigDFT 1.2.0.2.

I have made 3 configurations:
- one without FFTW support
- one with FFTW 3.2.2
- one with Intel MKL FFTW

Running times are surprising (for a given problem, on 3 nodes with 2 quad core processors each):

- without FFTW:
Proc. 0 individual time (sec): cpu= 1865.9 wall= 1865.9
Overall time at end (sec) : cpu= 44783.0 wall= 44783.0

- with FFTW 3.2.2:
Proc. 0 individual time (sec): cpu= 3189.2 wall= 3189.2
Overall time at end (sec) : cpu= 76541.7 wall= 76541.7

- with Intel MKL FFTW
Proc. 0 individual time (sec): cpu= 3124.3 wall= 3124.3
Overall time at end (sec) : cpu= 74986.3 wall= 74986.3

Is this a known behaviour? On an Opteron cluster with Intel compilers, OpenMPI and ACML, we don't see this (strange) behaviour.

Regards

Stefan

jzwanzig · Post by **jzwanzig** » Sat Nov 06, 2010 1:25 pm

Did you check that the FFTW-based build passes all the tests in the test suite? On my machines it does not. The internal FFT codes (from Stefan Goedecker) in abinit are extremely good and I would be surprised if you got significantly better performance with FFTW. On the other hand, as FFTW is a relatively recent addition and is not I believe yet fully ready for production release, I am NOT surprised that it is not always working as well as the internal FFT.

Alain_Jacques · Post by **Alain_Jacques** » Sat Nov 06, 2010 2:25 pm

All the parts of Abinit don't benefit from FFTW - there are probably more enhancements in GW for the moment so it should be considered as work in progress. FFT data size has a strong influence. Matteo can confirm that he has significant speed improvements on specific cases. So don't expect skyrocketing performance on any test case and size by switching to FFTW

Stefan - a daft question - did you correctly set the fftalg variable to perform your different benchmarks? Compiling with FFTW doesn't mean that FFTW equivalent routines are automatically invoked. The default is still Goedecker.

A caveat emptor for all FFTW/MKL adopters. FFTW libraries are very often available as multi threaded. Mixing MPI parallelism with multi threaded libraries is often leading to a decrease in terms of performance due to CPU overloading. I would advise to link sequential FFTW3 or MKL libraries with a MPI parallelized Abinit; multi threading FFT (and LAPACK) may be only efficient on very large datasets that cannot be MPI parallelized. Check that you're not creating more threads than available cores during an Abinit run - do a "top", press "H" to see all Abinit instances running at the same time and if it is more than all the available cores in your box then you're in fact slowing things down due to context switching.

Kind regards,

Alain

sbecuwe · Post by **sbecuwe** » Sun Nov 07, 2010 10:14 pm

Since we did not use fftalg for this particular test, and as a consequence use the built-in fft routines, we did not expect to see any difference whether abinit is linked with(out) FFTW or MKL... It seems mysterious.

Regards
Stefan

gmatteo · Post by **gmatteo** » Thu Nov 11, 2010 9:12 pm

The default value for fftalg is set to 312 when FFTW3 support is enabled (see indefo.F90), hence
the last two calculations have been done with the FFTW3 routines instead of the FFT code from Goedecker.

I made this choice simply because it facilitates the testing of the Abinit+FFTW3 executable on the test farm
we routinely use here in Louvain.

A possible explanations for the degradation in the performance of the code might be:

1) The FFT divisions are not optimal for the FFTW3 library. This usually happens when ngfft(1:3) contains powers of 7-11 or greater.
At present there is no check on the FFT divisions used for FFTW3, we just use the values that are supported by Goedecker's library.

2) FFTW3 doesn't yet support the augmentation of the FFT mesh to reduce cache conflicts.
Cache conflicts might be detrimental especially when ngfft(1:2) is even. What is the value of ngfft used for your tests?

Could you post the running times obtained on the opteron cluster?

BTW:
There's a small tool in 98_main named fftprof that can be used to analyze the performance of the different FFT libraries for
a given values of ecut and lattice parameters.
The code can be run interactively and it will report the CPU and WALL time spent to perform the FFT with different algorithms
(complex-to-complex, real-to-complex, zero padding FFT ....)
Having some profiling of the FFT machinery for your particular architecture would be extremely helpful to spot the problem.

Best regards,
Matteo

sbecuwe · Post by **sbecuwe** » Tue Nov 16, 2010 10:40 am

We had not noticed that the default was changed. Adding fftalg 112 gives the old (better) timings in 6.4.1.

It seems that in our setup fftw3 is slower. Using fftprof, fftw3 shows to be faster for all kinds of fft, except "fourdp with cplex 1" (see output below).
So one question left: do you notice this behaviour on other platforms too and would it be better to use the original Goedeker in this specific case?

Regards
Stefan

Code: Select all

 ==== FFT mesh ====
  FFT mesh divisions ........................    90   90   60
  Augmented FFT divisions ...................    91   91   60
  FFT algorithm .............................   112
  FFT cache size ............................    16
 getmpw: optimal value of mpw=   24617
 ==== FFT mesh ====
  FFT mesh divisions ........................    90   90   60
  Augmented FFT divisions ...................    91   91   60
  FFT algorithm .............................   312
  FFT cache size ............................     0
 For input ecut=  1.050000E+02 best grid ngfft=      90      90      60
       max ecut=  1.233701E+02
 ==== FFT mesh ====
  FFT mesh divisions ........................    90   90   60
  Augmented FFT divisions ...................    91   91   60
  FFT algorithm .............................   112
  FFT cache size ............................    16
==== fourdp with cplex 1 ====
                Library        CPU-time  WALL-time  ncalls
    FFTW; R2C; zero-pad+cache   0.072   0.072     10
Goedeker; R2C; zero-pad+cache   0.016   0.016     10
 ==== fourdp with cplex 2 ====
                Library        CPU-time  WALL-time  ncalls
    FFTW; R2C; zero-pad+cache   0.019   0.019     10
Goedeker; R2C; zero-pad+cache   0.044   0.044     10
 ==== fourdp_cplx in-place (GW code) ====
                Library        CPU-time  WALL-time  ncalls
    FFTW; R2C; zero-pad+cache   0.019   0.019     10
Goedeker; R2C; zero-pad+cache   0.059   0.059     10
 ==== fourdp_cplx out-of-place (GW code) ====
                Library        CPU-time  WALL-time  ncalls
    FFTW; R2C; zero-pad+cache   0.015   0.015     10
Goedeker; R2C; zero-pad+cache   0.058   0.058     10
 ==== fourwf with cplex=2 and option=0 ====
                Library        CPU-time  WALL-time  ncalls
    FFTW; R2C; zero-pad+cache   0.017   0.017     10
Goedeker; R2C; zero-pad+cache   0.022   0.022     10

ABINIT Discussion Forums

abinit 6.4.1 + fftw => running time doubled?!

abinit 6.4.1 + fftw => running time doubled?!

Re: abinit 6.4.1 + fftw => running time doubled?!

Re: abinit 6.4.1 + fftw => running time doubled?!

Re: abinit 6.4.1 + fftw => running time doubled?!

Re: abinit 6.4.1 + fftw => running time doubled?!

Re: abinit 6.4.1 + fftw => running time doubled?!