Scaling up - mpirun, mkmem, k-points

Total energy, geometry optimization, DFT+U, spin....

Moderator: bguster

Locked
paduano
Posts: 7
Joined: Sat Apr 20, 2013 2:43 am

Scaling up - mpirun, mkmem, k-points

Post by paduano » Wed May 08, 2013 1:59 am

hi,

I am working on a NV-diamond problem and I am trying to scale up
the number of k-points in my mesh to resolve fine structure in DoS.
I am encountering two unexpected problems:

1. When I increase the number of k-points, I increase the memory
footprint of a single processor job. This was unexpected; I
thought the solution of the problem at each k-point was
independent of the other k-points, that Etotal was an average over
k-points (via wtk) of the occupied bands/states.

I see from reading docs for mkmem that by default all nkpt ground
state wavefn's are kept in memory. Why is more than one at a time
needed? When I change this setting to 0, it impacts memory footprint
dramatically, but not run time. Why is the default nkpt and not 0?


2. When I use "mpirun -np X" and try to take advantage of cores on a single
box, I cannot come close to the ratios from the plot in the tutorial. I
see small speedup from X=1 to X=2 (1.5-1.7) and almost no speedup from
X=2 to X=4. There is implied overhead scaling up with X; I expected the
work to be (much) more independent and the plots in the tutorial suggest
this should be so. Why might my jobs be experiencing so much more
overhead than expected? My system is 99.9% idle except for these jobs.


matt

User avatar
gmatteo
Posts: 291
Joined: Sun Aug 16, 2009 5:40 pm

Re: Scaling up - mpirun, mkmem, k-points

Post by gmatteo » Sat May 11, 2013 3:53 pm

1. When I increase the number of k-points, I increase the memory
footprint of a single processor job. This was unexpected; I
thought the solution of the problem at each k-point was
independent of the other k-points, that Etotal was an average over
k-points (via wtk) of the occupied bands/states.


Why unexpected? All the k-points are stored in memory during the SCF cycle to avoid IO that would
be detrimental for the efficiency.
The MPI version distributes the k-points among the processors (also bands and G if paral_kgb is used),
hence the memory scales with the number of MPI nodes (keeping the size of the problem fixed).
Obviously the memory footprint increases If you increase the number of k-points while keeping the number of MPI nodes constant.


Why is more than one at a time needed? When I change this setting to 0, it impacts memory footprint
dramatically, but not run time. Why is the default nkpt and not 0?


We do that to avoid having to read the block of wavefunctions at each step of the SCF cycle.
Sorry, but I cannot believe that mkem=0 does not have any impact on the run-time. Could you provide the input file of the run?


2. When I use "mpirun -np X" and try to take advantage of cores on a single
box, I cannot come close to the ratios from the plot in the tutorial. I
see small speedup from X=1 to X=2 (1.5-1.7) and almost no speedup from
X=2 to X=4.


How many (physical) cores do you have in your machine? Configuration details?

paduano
Posts: 7
Joined: Sat Apr 20, 2013 2:43 am

Re: Scaling up - mpirun, mkmem, k-points

Post by paduano » Tue May 14, 2013 11:51 pm

hi,

Thanks for your reply.

One point that wasn't clear from my original post: I happen to be
computing the bands for a given input density file (e.g. iscf=-2)
because I am trying to scale nkpt.

For true SCF calcs, I see why one needs to know the (new) total density
at each iteration before continuing for a given k-point. I am not sure
why the *density* would be different from k-point to k-point in a given
band. But I suppose it could be (I expect they are are least very similar
in practice)...

> Why unexpected?

Because I can compute the bands/spectra one k-point at a time and I get
exactly the same band structure in exactly the same total CPU time (at
least with iscf=-2). Yet, either in memory or on disk, abinit tracks a large
WF structure that grows with nkpt (i.e. mkmem). I can beat this by asking
abinit to compute one k-point at a time and then recombine the bands
myself at the end. Otherwise, I'd never be able to set nkpt to, say, 50,000.

So I wonder why abinit seems to make me do it this way. I've done
example calcs for the 63 atom NV diamond defect and the results are
*exactly* the same whether I run abinit once with nkpt=1000 or whether
I break those 1000 kpts into subsets and recombine myself... I've tried
various ways of choosing the subsets, both in terms of size and in terms
of BZ sampling. All lead to same DoS/bands... I haven't tried this with
a full SCF calc.

[ an annoying aside related to this manual procedure is that abinit won't
read in the wtk vector when I use kptopt=0. abinit writes out the EIG file
with all the wtk values as 1.0 and I have to track the weights separately.
This fact alone, that the results are independent of wtk, is another way of
saying that these calcs are independent from k-point to k-point. Hence,
why drag around some big structure and not just compute them one at a
time and move on? ]


> cannot believe that mkem=0 does not have any impact on the run-time.

you are quite correct! The small calc I did when writing the post had a
small nkpt and that file appears to have been efficiently cached by the
disk cache in the kernel on my linux box. When I let the file grow
beyond size of main memory, things gum up rapidly, as expected.

I agree that the default value of mkmem makes sense as one wants to
know when setting the "go really really slow" flag.


> How many (physical) cores do you have in your machine? Configuration details?

I think I've mostly figured this part out. On my laptop (intel core i5), it appears
that the core pipelines are not fully independent for abinit workload. For two
parallel jobs, I get a 1.7x speedup. For four parallel jobs, I get something like 1.9x
(in 4x the CPU time!). I'm not sure what instructions/processing-units are blocking
my i5, but when I run the same jobs sequentially the CPU times add up. I.e.,
with a cluster of independent machines or with a better CPU (e.g. Xeon) with
fully redundant cores I see 2x, 4x etc. I have seen this with other code I write,
but only with 4 jobs, never with just 2 before... I thought the i5 had at least
2 independent cores that were each hyper-threaded, but I now have proof otherwise.
One must pay some mind to all the marketing BS from Intel!


matt

Locked