Hi everyone,
I am planning to do some parallel response function calculations (for the phonon dispersions) on a large(ish) system of 80 atoms (352 bands) a 5x3x2 K-point grid (30kpts when kptopt = 3) and a 25Ha cutoff using PAWs and the LDA.
The most time consuming part of this calculation is the finite q part. i.e:
getwfk 1 # Use GS wave functions from dataset1
kptopt 3 # Need full k-point set for finite-Q response
rfphon 1 # Do phonon response
rfatpol 1 80 # Treat displacements of all atoms
rfdir 1 1 1 # Do all directions (symmetry will be used)
tolvrs 1.0d-12 # This default is active for sets 3-10
I want to work out how to efficiently distribute the load for this over k-points, the fft grid and bands. Problem is, paral_kgb doesn't work here, and, setting any value (other than 1) for npfft or npband sets it back to 1 when the calculation starts, i.e,
--- !WARNING
src_file: m_mpi_setup.F90
src_line: 267
message: |
For non ground state calculation, set bandpp, npfft, npband, npspinor npkpt and nphf to 1
...
I have read that only K-pt parallelisation works here, however, the abinit website (https://docs.abinit.org/topics/parallelism/) reports otherwise, saying:
For response calculations, the code has been parallelized (MPI-based parallelism) on k-points, spins, bands, as well as on perturbations. For the k-points, spins and bands parallelisation, the communication load is rather small also, and, unlike for the GS calculations, the number of nodes that can be used in parallel will be large, nearly independently of the physics of the problem. Parallelism on perturbations is very similar to the parallelism on images in the ground state case (so, very efficient), although the load balancing problem for perturbations with different number of k points is not adressed at present. Use of MPIIO is mandatory for the largest speed ups to be observed.
I then have two questions:
1) How does parallelism work in a phonon calculation
2) How do I best set the number of processors for such a calculation (according to the number of bands, kpts and the fft grid?)
3) Does the hybrid MPI/openMP parallelisation help for RF calculations? I ask since it doesn't mention this on the website (from what I have found, at least).
Best,
Jack
Best parallelism for 1st order response calculations.
Moderators: mverstra, joaocarloscabreu
Re: Best parallelism for 1st order response calculations.
Dear Jack,
This means that you have to first spread the k-points on CPU and if you can put more start to spread over bands. When I say "spread" I mean that you just have to choose the number of CPU in your mpirun calculation and Abinit will handle the parallelism.
Best wishes,
Eric
Parallelism of DFPT works on k-points and bands by default, without parallel_kgb.
This means that you have to first spread the k-points on CPU and if you can put more start to spread over bands. When I say "spread" I mean that you just have to choose the number of CPU in your mpirun calculation and Abinit will handle the parallelism.
For example, if you have 40 k-points, you can parallelize the calculation on k-points up to 40 CPU. Then you for each k-point you can parallelize over bands, meaning that you can split the band calculation into, e.g. 2 CPU per k-points, which makes a job of 40x2 = 80 CPU, the speedup should be (ideally) close to 2 times faster than on 40 CPU with only k-points. And then you can go on: 40x4=160 CPU, 40x6=240 CPU, 40x8=320 CPU. The speedup will not be ideal depending on how optimal is the compilation on your machine, how good is the communication between the CPU, etc. Do a small test of speedup if you want to know how good it is on your case and to know what is the optimal number of CPU.
Not yet openMP for DFPT, it is on going!
Best wishes,
Eric
Re: Best parallelism for 1st order response calculations.
Hi Eric,
Thanks a lot. This now makes a lot of sense. Just to clarify though:
1) For best performance, I need to just set Nproc = Nkpt*B, where B is an integer, and find the value of B with the most efficient speed-up (which will drop off at large B)?
2) Does this value of B in anyway need to line up with (i.e, be a factor of) the total number of bands? I'm guessing not!
3) Since (i guess) a lot of the performance drop off at large B is from bottlenecks in MPI communication, would i be better off (in the absence of openMP threading) under-occupying nodes to avoid saturation of MPI channels?
Thanks again for your help!
Jack
Thanks a lot. This now makes a lot of sense. Just to clarify though:
1) For best performance, I need to just set Nproc = Nkpt*B, where B is an integer, and find the value of B with the most efficient speed-up (which will drop off at large B)?
2) Does this value of B in anyway need to line up with (i.e, be a factor of) the total number of bands? I'm guessing not!
3) Since (i guess) a lot of the performance drop off at large B is from bottlenecks in MPI communication, would i be better off (in the absence of openMP threading) under-occupying nodes to avoid saturation of MPI channels?
Thanks again for your help!
Jack
Re: Best parallelism for 1st order response calculations.
Code: Select all
Parallelism of DFPT works on k-points and bands by default, without parallel_kgb.
This means that you have to first spread the k-points on CPU and if you can put more start to spread over bands. When I say "spread" I mean that you just have to choose the number of CPU in your mpirun calculation and Abinit will handle the parallelism.
The number of points in the IBZ(q, idir, ipert) (let's call it nk_pertcase) is reported in this section of the main output file:
Code: Select all
Perturbation wavevector (in red.coord.) 0.000000 0.000000 0.000000
Perturbation : displacement of atom 1 along direction 1
Found 2 symmetries that leave the perturbation invariant.
symkpt : the number of k-points, thanks to the symmetries,
is reduced to 72 .
There are two points worth considering:
1) If you run all the perturbations in a single input file it's almost impossible to find an optimal number of MPI procs as each perturbation will have its own irreducible wedge. In principle, one can use the parallelism over the perturbations (https://docs.abinit.org/variables/paral/#paral_rf). This technique is handy as everything can be done with a single input file but I'm not a big fan of this approach as different perturbations may require different number of iterations to converge so you will get load imbalance. Last but not least, some perturbations may not convergence.
In this case, the code will stop and you won't get any result.
2) Not all the data structures in the DFPT code scale at the level of the memory. At a certain point, one hits the MPI bottleneck that prevents you from running with all the procs of the compute node. In this case, one should consider OpenMP threads (see my comments below).
OpenMp may help mitigate the MPI bottleneck. The DFPT code is not optimized for OpenMP in the sense that most of the high-level loops are parallelized with MPI still one can use OpenMP at the level of the FFT, BLAS/Lapack and non-local part.3) Does the hybrid MPI/openMP parallelisation help for RF calculations? I ask since it doesn't mention this on the website (from what I have found, at least).
Obviously one should not expect the same scalability as in MPI but 2-4 threads may be beneficial if you are dealing with large systems as this hybrid MPI-OpenMP approach allows one to use all the CPUs on the nodes.
If I remember correctly, one of the bottleneck of DFPT is the routine that orthogonalizes the trial first order wavefunction wrt to the nband GS states. This step is performed with BLAS2 routines and can benefit from OpenMP threads (provided one uses a threaded BLAS library)
We recently added a new tutorial that explains how to activate support for OpenMP with intel and MLK (https://docs.abinit.org/tutorial/compil ... nd-modules).
Hope it helps.