Reducing memory needs

ruslan · Post by **ruslan** » Mon Oct 04, 2010 10:40 am

Hello, abinit users!

I am trying to calculate phonons using PAW formalism and abinit. Because it is not implemented [yet] i am falling back to phonopy code http://phonopy.sourceforge.net/ . The methodics is relatively straight-forward, program generates supercells with distorted atom positions and takes calculated forces as input to produce phonot spectra.

My problem arises when i am trying to simulate 3x3x3 supercell with 8 atoms in each unit cell. This results in roughly 1440 bands and i was unable to bring such a system to running due to out-of-memory errors. How can one reduce memory demands of simulation? The things i tried:

Just increasing number of processors does not help because the total number of k-points with ngkpt=4x4x4 is 16 and most of processors are idle.
Activation of paral_kgb option. Apart from standard settings wfoptalg 4, nloalg 4, fftalg 401, intxc 0 and fft_opt_lob 2 i tried enabling parallelization over k-points (npkpt 16, npband 2, npfft 1, bandpp 1). I tried various combinations of options, but i am missing some hint like "the memory used by each processor gets divided by npband (npfft or something else)". Is at all possible to reduce memory by parallelization?
Activating mkmem=0. Some data gets written to disc, but i am unsure whether the simulation gets slower (can one provide a rough estimate of slowdown factor?) The problem in this case is, that i cannot use band/fft parallelization and the simulation lasts for ages.

I am using a Cray XT6 machine with 24 processors and 32 GB/node. I tried reducing number of MPI tasks per node with little success. With 3 MPI tasks/node i could start the simulation, but this seem to be a very inefficient mode of operating parallel machine.

Kind regards,
Ruslan Zinetullin

mverstra · Post by **mverstra** » Mon Oct 11, 2010 12:08 pm

ruslan wrote:Hello, abinit users!

I am trying to calculate phonons using PAW formalism and abinit. Because it is not implemented [yet] i am falling back to phonopy code http://phonopy.sourceforge.net/ . The methodics is relatively straight-forward, program generates supercells with distorted atom positions and takes calculated forces as input to produce phonot spectra.

My problem arises when i am trying to simulate 3x3x3 supercell with 8 atoms in each unit cell. This results in roughly 1440 bands and i was unable to bring such a system to running due to out-of-memory errors. How can one reduce memory demands of simulation? The things i tried:
Just increasing number of processors does not help because the total number of k-points with ngkpt=4x4x4 is 16 and most of processors are idle.

4x4x4 = 64 - how many processors do you want to use? In most cases as you displace atoms you will break symmetry and use all of the k-points (of course nkpt will change for each displacement case...)

Activation of paral_kgb option. Apart from standard settings wfoptalg 4, nloalg 4, fftalg 401, intxc 0 and fft_opt_lob 2 i tried enabling parallelization over k-points (npkpt 16, npband 2, npfft 1, bandpp 1). I tried various combinations of options, but i am missing some hint like "the memory used by each processor gets divided by npband (npfft or something else)". Is at all possible to reduce memory by parallelization?

Yes, search the forum for paral_kgb and check the web site and variable definitions. Parllelizing over g vectors is even more efficient in reducing memory, but it can be much worse for parallel scaling if you have a slow network (ethernet)

Activating mkmem=0. Some data gets written to disc, but i am unsure whether the simulation gets slower (can one provide a rough estimate of slowdown factor?) The problem in this case is, that i cannot use band/fft parallelization and the simulation lasts for ages.

It will get slower. The factor will depend on your disk i/o so there is no rule (some systems are fine with this mode). And indeed there are a number of options you can't use with mkmem 0 - there is no guarantee all processors will have access to the scratch file, and some combinations are just not coded.

I am using a Cray XT6 machine with 24 processors and 32 GB/node. I tried reducing number of MPI tasks per node with little success. With 3 MPI tasks/node i could start the simulation, but this seem to be a very inefficient mode of operating parallel machine.

Kind regards,
Ruslan Zinetullin

ruslan · Post by **ruslan** » Wed Oct 13, 2010 12:33 pm

mverstra wrote:4x4x4 = 64 - how many processors do you want to use? In most cases as you displace atoms you will break symmetry and use all of the k-points (of course nkpt will change for each displacement case...)

let's consider following example: 3x3x3 supercell of Nb3Sn. First, i tried to make a run with ideal atomic positions, an load this WF into perturbed simulations to ease convergence. I have following setup:

Code: Select all

acell 29.6035314709658941 29.6035314709658941 29.6035314709658941
ntypat 2
znucl 41 50
natom 216
typat 162*1 54*2
nband 1440

ecut 20
pawecutdg 20
kptopt 1
ngkpt 4 4 4

paral_kgb 1
wfoptalg 4
nloalg=4
fftalg=401
intxc=0
fft_opt_lob=2

npkpt 4
npband 12
npfft  2
bandpp  4

prtposcar 1

chkprim 0
maxnsym 1296 # supercell has more symmetries

prtden 1
prtwf 1
prteig 0
tolvrs 1.0d-16
diemac 1000000
xred
... positions here

This case results in nkpt=4, which is fairly small. Memory needs seem to be within limits:

P This job should need less than 1057.967 Mbytes of memory.
Rough estimation (10% accuracy) of disk space for files :
WF disk file : 811.760 Mbytes ; DEN or POT disk file : 6.594 Mbytes.

Biggest array : pawfgrtab%gylm(gr), with 302.9008 MBytes.

But my problem is, that i cannot run even such a simple system with 96 processors! log:

ITER STEP NUMBER 1
vtorho : nnsclo_now= 2, note that nnsclo,dbl_nnsclo,istep= 0 0 1
starting lobpcg, with nblockbd,mpi_enreg%nproc_band 30 12
[NID 00040] 2010-10-13 01:01:40 Apid 23269: initiated application termination
[NID 00040] 2010-10-13 01:01:48 Apid 23269: OOM killer terminated this process.
Application 23269 exit signals: Killed
Application 23269 resources: utime 0, stime 0

Each node has 32GB of memory, which seem enough for this job. I ran abinit with paral_kgb=-96 and got

Code: Select all

  nproc    npkpt    npband     npfft    bandpp    weight
    96       4        6         4          4           0.25
    96       4        8         3          4           0.50
    96       4       12         2          4           1.50
    96       4       24         1          4           1.00

Obviously it makes no sence to increase npkpt, because there are only 4 k-points. Which parameter could be adjusted (if any) to run the simulation? I already reduced values of ecut/pawecutdg to rather small value of 20.

Yes, search the forum for paral_kgb and check the web site and variable definitions. Parllelizing over g vectors is even more efficient in reducing memory, but it can be much worse for parallel scaling if you have a slow network (ethernet)

What do you mean by parallelizing over g-vectors? npkpt? As you can see i set maximal allowable value of npkpt and it does not really help.

Kind regards,
Ruslan Zinetullin

ABINIT Discussion Forums

Reducing memory needs

Reducing memory needs

Re: Reducing memory needs

Re: Reducing memory needs